SlideShare a Scribd company logo
1 of 122
Class 8…making things FAIR
'if I have seen further it is by
standing on the shoulders of
giants'.
Scott Edmunds, HKU Data Curation MLIM7350
Communicating in-class
• Chat channel:
• http://backchannelchat.com/chat/dw131
• Let me know to slow down/speed up
https://osf.io/cgpzb/
Open Science (Open Access & Open
Data) survey of Hong Kong
Reading/Reflection
Most people mentioned training of librarians:
Tak Hei Lam: “Training should be provided to librarians so that they have adequate
knowledge about data curation and provide professional support and advice for the
researchers to sharing of data. Also, librarians can provide training and workshop to
change the mindset of the researcher not to rely on the impact factor but on other to
other comprehensive research metrics such as PlumX”
Lijia Yu: At the same time, in big data era, the research will be increasingly migrating
to the cloud, so this should be done in an organized manner.
Lots of talk on incentive systems & policy, but little on infrastructure other than:
NEED FOR A PLAN/LEADERSHIP
HKU Repeatability in HK
Research Experiment
(homework)
Feedback?
What have we found?
HKU Repeatability in HK
Research Experiment
(homework)
Interesting examples
http://hub.hku.hk/handle/10722/208585
Is data in a HKU thesis sufficient?
Interesting examples
Several examples of restrictions with ID data
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0165978
Interesting examples
Several examples of restrictions with ID data
http://www.vox.com/2015/6/17/8796225/mers-virus-data-sharing
Interesting examples
Lots of data in Dryad, but 1 H7N9 example isn’t resolving
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0148506
Story so far
• HKU publishing a lot of survey based research in PLOS
• 3 examples from “Children of 1997” birth cohort. Access to data
involves emailing DAC
• External databases: 2 examples in Dryad data (one not working),
1 example in OSF, 1 example in scholarhub, lots in figshare
• So far 2 have data with broken URLs, 1/3 are controlled access,
1/4 have summary but not raw data
WHAT EXACTLY IS “RESEARCH DATA"?
Research Data 1665?
Scholarly articles are merely advertisement of scholarship . The
actual scholarly artefacts, i.e. the data and computational
methods, which support the scholarship, remain largely
inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab
and reproducible research, 1995
Esoteric formats, poorly structured,
Tabular, often spreadsheet based
Issues open data community well used to
(data cleaning, scraping, etc.,)
The long tail of scientific data…
Science Data Volumes
Exabytes Petabytes100’s of Petabytes
Sequencing
Mass Spec
Astrophysics HE Physics Biology
Imaging
Square Kilometer Array
Large Hadron Collider
Big Data in Healthcare
http://dx.doi.org/10.1186/s13742-016-0117-6
Big Data in Healthcare: challenges
• 80% of health data unstructured (100’s of
forms/formats)
• Medical Imaging archives increasing 20-40% per year
• Genomics data will increase data volumes exponentially
• Patients expect extra privacy protection if they are
going to fully participate in data driven research
Source: https://www.healthcare.siemens.com/magazine/mso-big-data-and-healthcare-2.html
Open Data in Physics
1961 CERN pre-prints shelf
http://cerncourier.com/cws/article/cern/28654
http://arxiv.org/
1991-date arXiv
Open Data in Earth Sciences
https://pangaea.de/Established 1987 (online since 1995)…
Open Data in Earth Sciences
#Climategate UAE emails “scandal”
Is it possible to be too open?
Closed Data in Chemistry
Open Data in Biology
1934: newsletter era 1987: online era1980: database era 2010’s: “bioinformatics
bingo” era
BGI HK Chamber O’Illumina’s
The LHC of Biology?
20PB of storage
Post-Human Genome Project
1st Gen
2nd (next) Gen
Source: http://www.genome.gov/sequencingcosts/ (with apologies)
Omes & more omes!
Other Ome(s): mass spectrometry data
https://en.wikipedia.org/wiki/Mass_spectrometry
Nadina Wiórkiewicz
Rise of mass spectrometry data
https://doi.org/10.1093/nar/gkv1352
Challenges: Rise of big imaging data
http://www.nature.com/nmeth/journal/v12/n1/full/nmeth.3222.html
Challenges: Rise of big imaging data
https://openi.nlm.nih.gov/detailedresult.php?img=PMC3171117_JCB_201108095_RGB_Fig2&req=4
http://journals.sagepub.com/doi/10.1177/1087057114528537
HCS: High Content Screens
AKA High Throughput
Screening: High volumes,
growing uptake – TBs of
data
New ways of
sharing/publishing data
with OMERO/JCB data
viewer
Imaging Challenges: 100s of formats
http://www.openmicroscopy.org/site/products/bio-formats
V
Genomics: open-data success story?
Sharing/reproducibility helped by
stability of:
1. Platforms
1. Repositories
2. Standards
1st Gen 2nd Gen
:
Genomics Data Sharing Policies…
1. Automatic release of sequence assemblies within 24 hours.
2. Immediate publication of finished annotated sequences.
3. Aim to make the entire sequence freely available in the public domain for
both research and development in order to maximise benefits to society.
Bermuda Accords 1996/1997/1998:
1. Sequence traces from whole genome shotgun projects are to be
deposited in a trace archive within one week of production.
2. Whole genome assemblies are to be deposited in a public nucleotide
sequence database as soon as possible after the assembled sequence
has met a set of quality evaluation criteria.
Fort Lauderdale Agreement, 2003:
The goal was to reaffirm and refine, where needed, the policies related to
the early release of genomic data, and to extend, if possible, similar data
release policies to other types of large biological datasets – whether from
proteomics, biobanking or metabolite research.
Toronto International data release workshop, 2009:
https://doi.org/10.1093/gigascience/giw003
Three decades of sharing infrastructure: Genbank
Scaling up of sharing: 1000 genomes
http://www.internationalgenome.org/
Three decades of sharing infrastructure: INSDC
http://www.insdc.org/
Sharing aids individuals
Piwowar HA, Day RS, Fridsma DB (2007)
PLoS ONE 2(3): e308.
doi:10.1371/journal.pone.0000308
Sharing Detailed Research
Data Is Associated with
Increased Citation Rate.
Every 10 datasets collected contributes to at least 4 papers in the
following 3-years.
Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473
(7347), 285-285 DOI: 10.1038/473285a
0
100
200
300
400
500
600
700
rice wheat
Rice v Wheat: consequences of publically available
genome data.
Sharing aids fields…
Sharing aids growth of databases…
http://scienceblogs.com/digitalbio/2015/01/30/bio-databases-2015/
Sharing aids growth of standards…
Why do we need standards?
https://xkcd.com/927/
Sharing aids growth of standards…
Why do we need standards?
http://www.biochemsoctrans.org/content/36/1/33
Checklists aid the growth of sharing…
http://www.equator-network.org/
There are over 860 databases &
675 standards in the life sciences
Formats Terminologies Guidelines
Guidelines = Minimum information
reporting requirements, checklists
o Report the same core, essential
information
o e.g. ARRIVE guidelines
Terminologies = Controlled
vocabularies, taxonomies,
thesauri, ontologies etc.
o Unambiguously refer to
an entity
o e.g. Gene Ontology
Models/Formats = Conceptual
model, conceptual schema,
exchange formats
o Allow data to flow from
one system to another
o e.g. FASTA
Enablers: to better describe,
share and query data
Formats Terminologies Guidelines
https://biosharing.org/
Need for databases of databases
Exercise: Use Biosharing to answer the following?
To share your work are there standards you should follow?
Are there specialized curated databases you can use?
A. You work in the area of functional MRI
imaging and are producing 100’s of GBs
of fMRI brain scan data.
B. You are an immunologist using flow
cytometry to sort cells.
C. You are a chemist looking at the 3D
crystal structure of proteins using NMR
https://biosharing.org/
Potential collaborators would like to use your data.
Sabban, Sari
SharingOpen Data
Methods
Answer
Metadata
softwareAnalysis
(Pipelines)
Workflows/
Environments
Idea
Study
Rewarding the
DOI, etc.
Publication
Publication
Publication
Data
gigagalaxy.net
Workflows
Reward Sharing of Workflows
Visualisations
& DOIs for workflows
http://www.gigasciencejournal.com/series/Galaxy 50
Facilitate reproducibility, reuse & sharing & publish outputs of:
Knitr, Sweave, Jupyter/iPython Notebook, etc.
Open Documents
Reward Open/Dynamic Workbooks
Virtual Machines/containers
http://dx.doi.org/10.1186/s13742-015-0087-0
:standardised containers
https://opensource.org/licenses
https://opensource.org/licenses
Open Source v Open Data Licenses
Same ethos (open source begat open data), different contexts
• OSS designed for continuing development, OD for making
objects available
• IP issues. Software can be patented, data (generally) can’t
• More business models for software than data (so far…)
• Wider selection of OSS licenses, and more options to fine-
tune access (Linking, Distribution, Modification, Sublicensing,
Patents/Trademarks, etc.)
• Now researchers are producing such large &
heterogeneous datasets, what do you think the
challenges are for producers and users?
• What are the legal implications of mixing data and
software?
• What do you think the security issues of accessing
these complex combined research objects are?
Questions to ask?
Questions? | 15 minute break
Research Data: Pop Quiz
What was #climategate?
What is the INSDC, and who are the three INSDC partners?
What is the estimated yearly growth of medical imaging data?
What are bioboxes?
How many databases are currently listed in biosharing?
Which of the reporting guidelines/checklists are for A) animals, B)
biological science, and C) clinical research: MIBBI, ARRIVE and
Equator
ETHICS & DATA SECURITY ISSUES
Ethics: needs approval
http://www.rss.hku.hk/integrity/ethics-compliance
Ethics: clinical trials need registration
http://www.hkuctr.com/
Ethics: need informed consent
http://www.med.hku.hk/images/document/04research/institution/5QMH_IRB_GUIDAN
CE_NOTES_FOR_THE_PREPARATION_OF_PATIENT_CONSENT.pdf
Where does data sharing fit into this?
WILL MY TAKING PART IN THIS STUDY BE KEPT CONFIDENTIAL?
You will need to obtain the patient’s permission to allow restricted access to
their medical records and to the information collected about them in the
course of the study. You should explain that all information collected about
them will be kept strictly confidential. A suggested form of words is:
“All information which is collected about you during the course of the research will
be kept strictly confidential. Any information about you which leaves the
hospital/surgery will have your name and address removed so that you cannot be
recognised from it.”
HKU Guideline Notes - for Preparation of Subject Information
Sheet & Informed Consent Form:
Ethics: includes animal research
http://www.med.hku.hk/research/research-ethics/animal-ethics-culatr
Ethics: includes animal research
https://www.nc3rs.org.uk/arrive-guidelines
Lots of tools available: anonymisation
https://www.ukdataservice.ac.uk/manage-data/tools-and-templates
Lots of tools available: encryption
https://www.brookes.ac.uk/Research/Research-ethics/Encrypting-files/
Lots of tools available: DAC & brokering
https://blog.repositive.io/getting-data-out-of-the-ega/
Lots of tools available: DAC & brokering
http://www.ckbiobank.org/site/
Lots of tools available: DAC & brokering
http://www.ckbiobank.org/site/
Kinds of identifying information
• Direct identifiers
– Names, addresses, postcode information,
telephone numbers or pictures
• Indirect identifiers
– In combination with other information, would
identify e.g. information on workplace, occupation
or exceptional values of characteristics like salary
or age
http://www.data-archive.ac.uk/create-manage/consent-ethics/anonymisation
De-identification #101
Anonymising audio-visual data
• Anonymisation of audio-visual data, such as editing of digital
images or audio recordings, should be done sensitively. Bleeping
out real names or place names is acceptable, but disguising voices
by altering the pitch in a recording, or obscuring faces by pixellating
sections of a video image significantly reduces the usefulness of
data. These processes are also highly labour intensive and
expensive.
• If confidentiality of audio-visual data is an issue, it is better to
obtain the participant's consent to use and share the data
unaltered. Where anonymisation would result in too much loss of
data content, regulating access to data can be considered as a
better strategy.
• We urge researchers to consider and judge at an early stage the
implications of depositing materials containing confidential
information and to get in touch to consult on any potential issues.
https://www.ukdataservice.ac.uk/manage-data/legal-
ethical/anonymisation/qualitative
Considerations for medical imaging
https://openfmri.org/de-identification/
https://sourceforge.net/projects/privacyguard/
Need to also ensure
DICOM (Digital Imaging
and Communications in
Medicine) metadata also
passes through de-
identification toolkit
MRI brain scans first
undergo skull stripping
Automated Defacing Tools
required beyond this
Considerations for medical images
https://bmcmedgenet.biomedcentral.com/articles/10.1186/1471-2350-15-21
https://bmcmedgenet.biomedcentral.com/articles/10.1186/1471-2350-11-26
• Sharing of clinical images crucial in understanding “phenotypes”
• Require ”consent to publish”, but challenges doing this with ill
people, children, elderly, and disadvantaged
• Further challenges in era of social media, open access and wikipedia
• Security issues protecting signed consent forms
Not just a metadata problem…
http://science.sciencemag.org/content/339/6117/321
Extra considerations for HK
Hospital Authority restrictions on data
• Have to apply to Hospital Authority to access public health data
• Only approved 14 data requests (as of May 2016)
• If approved requires data recovery charges (collect $250,000
HKD a year from this)
• Can publish aggregate/summary data in journals, but not share
data
• Only approves academic use, not citizens or industry/pharma
Via FOI request: https://accessinfo.hk/en/request/request_for_statistics_on_data_c
Extra considerations for China
Human genetic data needs MOST approval
Article 2: The term "human genetic resources" in the Measures refers to the genetic materials such
as human organs, tissues, cells, blood specimens, preparations of any types or recombinant DNA
constructs, which contain human genome, genes or gene products as well as to the information
related to such materials.
Second, any international collaborative project involving Chinese human genetic resources, for
example international research cooperation and exporting human genetic resources or taking such
resources outside of the territory of China should shall apply to MOST for examination and approval
prior to entering into an official contract. And Chinese collaborating party shall be responsible for
going through the due formalities of application for approval. (See Article 11)
http://www.chinadaily.com.cn/china/2010-08/12/content_11141879.htm
Foshan, 2010
Extra considerations for China
Can this data be easily de-identified & shared?
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152381
“The individual in this manuscript has given written informed consent
(as outlined in PLOS consent form) to publish their images. Following
approval by the Institutional Review Board (IRB) of The University of
Hong Kong and Hospital Authority Hong Kong West Cluster (UW 14–
159); 20 individuals, 10 male and 10 female volunteers, were properly
instructed and gave consent to participate in this study by signing the
appropriate informed consent paperwork. “
FAIR or unfair? Principled publishing for data. 公平
或者不公平? 数据发表的原则
What is FAIR (公平的)?
Adverb
Without cheating or trying to achieve unjust advantage.
‘no one could say he played fair’
Adjective
Treating people equally without favouritism or discrimination.
‘the group has achieved fair and equal representation for all its
members’
‘a fairer distribution of wealth’
fair /fɛː/
475, 267 (2011)
http://www.nature.com/news/2011/110720/full/475267a.html
“Wide distribution of information is key to scientific progress, yet
traditionally, Chinese scientists have not systematically released
data or research findings, even after publication.“
“There have been widespread complaints from scientists inside and
outside China about this lack of transparency. ”
“Usually incomplete and unsystematic, [what little supporting data
released] are of little value to researchers and there is evidence that
this drives down a paper's citation numbers.”
Is this FAIR? 这是FAIR?
FAIR questions to ask?
Is the raw data publically available?
Are the reagents (plasmids, cells,
antibodies, etc.) available?
Are detailed protocols available?
Can I access the processed data &
results (supporting the figures)?
Was this all available BEFORE
publication to the peer reviewers?
Can I inspect the peer reviews?
Can I publish/link +/-ve replication
experiments to this?
A more FAIR approach: Open Data?
Research Objects: a concept & model
http://www.researchobject.org/
• Supporting publication of more than just PDFs, making data, code, & other resources first class citizens
of scholarship.
• Recognizing that there is often a need to publish collections of these resources together as one
shareable, cite-able resource.
• Enriching these resources and collections with any & all additional information required to make
research reusable, & reproducible!
Importance of metadata: context (& discoverability)
https://library.stanford.edu/research/data-management-services/data-best-practices/best-practices-file-naming
https://twitter.com/AlisonMcNab/status/751375987624009728/photo/1
?
Novel tools/formats for data interoperability/handling: ISA
Importance of metadata: context (& discoverability)
Where do you set it?
Experiment
(e.g. International
Cancer Genome
Consortium)
Datasets
(e.g. cancer type)
Sample
(e.g. specimen xyz)
e.g. doi:10.5524/100001
e.g. doi:10.5524/100001-2
e.g. doi:10.5524/100001-2000
or doi:10.5524/100001_xyz
Smaller still?
Importance of granularity
Papers
Data/
Micropubs
NanopubsFacts/Assertions (~1013 in literature)
Importance of granularity
http://www.nature.com/ng/journal/v43/n4/full/ng.785.html
Importance of granularity
http://www.nature.com/ng/journal/v43/n4/full/ng.785.html
Assertion
Nanopublication URL
Provenance PublicationInfo
assertio
n
opm:
was
Derived
From
opm:
wasGene-
ratedBy
this
nanopub
dcterms:
created
pav:
authored-
By
associa-
tion
a
sio:statis-
ticalAssociation
sio:has-
measurem
entValue
Association_1_
p_value
a
Sio:probability-
value
sio:has-value
6.56e-5
^^xsd:float
sio:
refers-to
dcterms:
DOI
…
Integrity Key
An Individual association
between concepts:
• statement or declaration
• measurement
• hypothetical inference
• quantitative or qualitative
Guarantee immutability
after publication
Unique, persistent and
resolvable identifier
How this assertion came
to be, methods,
evidence, context, etc.
• Detailed attribution
for authors,
institutions, lab
technicians, curators
• License info
• Publication date
A nanopub represents structured data
along with its provenance in a single
publishable & citable entity.
http://nanopub.org/
Lots of models/standards/guidelines
Where does that leave us?
?
5★ open data
A mnemonic to remember: FAIR
一个帮助记忆的词语:FAIR
http://www.nature.com/articles/sdata201618
http://www.datafairport.org/
Findable 可发现的
Accessible 可得到的
Interoperable能共同使用的
Reusable 可以再度使用的
Lots of models/standards/guidelines
Where does that leave us?
A mnemonic to remember: FAIR
http://www.nature.com/articles/sdata201618
http://www.datafairport.org/
A mnemonic to remember: FAIR
http://www.nature.com/articles/sdata201618
http://www.datafairport.org/
To be Findable:
F1. (meta)data are assigned a globally unique and persistent
identifier
F2. data are described with rich metadata (defined by R1 below)
F3. metadata clearly and explicitly include the identifier of the
data it describes
F4. (meta)data are registered or indexed in a searchable resource
A mnemonic to remember: FAIR
http://www.nature.com/articles/sdata201618
http://www.datafairport.org/
To be Accessible:
A1. (meta)data are retrievable by their identifier using a
standardized communications protocol
A1.1 the protocol is open, free, and universally implementable
A1.2 the protocol allows for an authentication and authorization
procedure, where necessary
A2. metadata are accessible, even when the data are no longer
available
A mnemonic to remember: FAIR
http://www.nature.com/articles/sdata201618
http://www.datafairport.org/
To be Interoperable:
I1. (meta)data use a formal, accessible, shared, and
broadly applicable language for knowledge
representation.
I2. (meta)data use vocabularies that follow FAIR principles
I3. (meta)data include qualified references to other
(meta)data
A mnemonic to remember: FAIR
http://www.nature.com/articles/sdata201618
http://www.datafairport.org/
To be Reusable:
R1. meta(data) are richly described with a plurality of accurate and
relevant attributes
R1.1. (meta)data are released with a clear and accessible data
usage license
R1.2. (meta)data are associated with detailed provenance
R1.3. (meta)data meet domain-relevant community standards
Beyond a mnemonic: FAIR ecosystems
FAIRifier tool
Beyond a mnemonic: FAIR ecosystems
• A particular class of FAIR Data System to provide support for
data interoperability;
• Supports publication, search and access to FAIR data.
• Fosters an ecosystems of applications and services;
• Federated architecture: different FAIRports (and other FAIR Data
Systems) are interconnectable;
• Supports citations of datasets and data items;
• Provides metrics for data usage and citation;
A ‘FAIRpoint or FAIRport’ can be any specific data instance following FAIR data
principles.
http://www.datafairport.org/
Beyond a mnemonic: FAIR ecosystems
http://www.datafairport.org/
?
Beyond a mnemonic: FAIR ecosystems
https://www.fair-access.net.au/fair-statement
“By 2020, Australian publicly funded researchers and research organisations
will have in place policies, standards and practices to make publicly funded
research outputs findable, accessible, interoperable and reusable.”
DTL/ELIXIR-NL
“Bring Your Own Data Party”
GigaScience/BGI HK
Metabolomics ISA-TAB athon v
More FAIR mnemonics: “BYODs”
FAIR Data in the wild
Taking a microscope to the
publication process
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0127612
How FAIR can we get?
如何获取FAIR?
Data sets
Analyses
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18
>50,000 accesses
& 885 citations
Open-Code
7 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-Pipelines
Open-Workflows
DOI:10.5524/100038
Open-Data
78GB CC0 data
Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/
>40,000 downloads
Enabled code to being picked apart by bloggers in wiki
http://homolog.us/wiki/index.php?title=SOAPdenovo2
Can we reproduce results? SOAPdenovo2 S. aureus pipeline
The SOAPdenovo2 Case study
Subject to and test with 3 models:
Data
Method/Experi
mental protocol
Findings
Types of resources in an RO
ISA-TAB/ISA2OWL
Nanopublication
Wfdesc/ISA-
TAB/ISA2OWL
Models to describe each resource type
1. While there are huge improvements to the quality of the resulting
assemblies, other than the tables it was not stressed in the text that
the speed of SOAPdenovo2 can be slightly slower than SOAPdenovo
v1.
2. In the testing an assessment section (page 3), based on the correct
results in table 2, where we say the scaffold N50 metric is an order of
magnitude longer from SOAPdenovo2 versus SOAPdenovo1, this was
actually 45 times longer
3. Also in the testing an assessment section, based on the correct
results in table 2, where we say SOAPdenovo2 produced a contig N50
1.53 times longer than ALL-PATHS, this should be 2.18 times longer.
4. Finally in this section, where we say the correct assembly length
produced by SOAPdenovo2 was 3-80 fold longer than SOAPdenovo1,
this should be 3-64 fold longer.
Lessons Learned 经验教训
• Most published research findings are false. Or at
least have errors
• With enough effort is possible to push button(s) &
recreate a result from a paper with current tools
• Being FAIR can be COSTLY. How much are you willing
to spend? Who will build FAIR infrastructure?
• Much easier to make things FAIR before rather than
after publication. BYODs useful intermediate here
http://www.nature.com/ng/journal/v48/n4/full/ng.3544.html
“The question to ask in order to be a data steward,
to handle data or to simplify a set of standards is
the same: “is it FAIR”?”
http://content.iospress.com/articles/information-services-and-use/isu824
Levels of FAIRness: A-F of FAIR data
In class activity: How FAIR is this data?
1. Data from: Live poultry exposure and public response to influenza
A(H7N9) in urban and rural China during two epidemic waves in
2013-2014 http://hub.hku.hk/cris/dataset/dataset93128
1. Supporting data for "Genomic analyses revealFAM84B and the
NOTCH pathway are associated with the progression of esophageal
squamous cell carcinoma” http://dx.doi.org/10.5524/100181
1. Linked Drug-Drug Interactions (LIDDI)
https://datahub.io/dataset/linked-drug-drug-interactions-liddi
http://content.iospress.com/articles/information-services-and-use/isu824
Reflection: how fair is FAIR?
Read the FAIR principles paper.
Do you think they are applicable and
feasible for HK? If it is feasible, what is
needed to implement them?
http://www.nature.com/articles/sdata201618
Any questions?
Does anyone have BYO data for the
curation/cleaning workshop?
Final Project
• For the final project for this course, you can
choose from 3 assignment options.
• The assignment is due on the 15th May and it is
worth 40% of your grade.
• Time will be set aside for presenting on this
during the final class on the 24th April: covering
why you chose the option, what
discipline/dataset/topic you are covering, and
what work you've done so far (5 mins per student
including any group feedback)
Final Project: Option 1
Write an Annotated Bibliography about data curation practices in an academic
discipline of your choosing.
• Choose a discipline (sciences, social sciences, & humanities) OR choose the topic of
“open data.”
• Summarize data practices in your chosen discipline or topic. (5-7 sentences)
• Find 7-10 sources that relate that discipline or topic to data creation, management,
and/or curation.
• Provide a citation for the source in APA style.
• Write a short annotation that summarizes the content of the source. You may
include quotes from the source sparingly, but the annotations should be mostly, if
not entirely, in your own words. (3-5 sentences)
• Explain the relevance of the source with relation to the data practices of your
chosen discipline or topic. (1-2 sentences)
• Find a few example public datasets to demonstrate the above points. Cite the data
in the relevant places in the Bibliography according to the Data Citation Principles.
• Refer to this guide for more information about annotated bibliographies:
http://sites.umuc.edu/library/libhow/bibliography_tutorial.cfm. Your annotation
should be in the “Descriptive” style.
Final Project: Option 2
Using a relevant dataset (this can either be from the literature curation
exercise, a BYO dataset, or one given to you), write a report that includes a
description of the dataset, a Data Management Plan, and a guidelines
document for the researcher(s).
• Describe the dataset that explains the form of the data and the academic discipline in which it
was created. This paragraph should provide context for the (3-5 sentences) 1-2 page Data
Management Plan following the guidelines from HKU or granting body such as NSF.
• 1 page guidelines document that could be presented to the researcher(s) that provides
guidelines for their data (extant and forthcoming):
– Preservation
– Appraisal
– Documentation
• For the DMP and the guidelines document, you can extrapolate from the your dataset to
imagine additional details about the research practices that created the dataset and will create
more data in the future.
• Look for suitable data repositories that can host this data (institutional, general purpose, or
subject specific), and if there is one relevant then publish the data if you have permission, and
correctly cite the data in the relevant places in your report. [disclaimer: if have permission]
Final Project: Option 3
Prepare a 30 minute data curation workshop that you could teach to
researchers that would provide them the necessary details to understand why
data curation is relevant to them and best practices they should follow.
• Slide deck that introduces data curation for a researcher audience. (No
more than 40 slides.)
• Presenter outline that describes the important points for each slide.
• Topics that might be addressed in your workshop: the value of data
management, writing a data management plan, data repository options.
You can assume your audience is researchers are at HKU.
• Make sure all of the content is copyright free, and share the final material
openly (e.g. figshare, scholarhub, OER commons, etc.), and with sufficient
metadata to make it discoverable.
Looking ahead…
• Submit 1 paragraph refection on FAIR principles
through moodle forum
• Next class (22nd April) is hands-on curation
workshop with Dr Chris Hunter
– Bring laptops and any data you may have for a data
cleaning exercise
• Final project due 15th May
– Need to present preliminary version on 26th April to
get feedback before completion. Send me slides by
the 25th April so I can get them ready for the class

More Related Content

What's hot

Reproducible research: First steps.
Reproducible research: First steps. Reproducible research: First steps.
Reproducible research: First steps. Richard Layton
 
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014Microsoft Azure for Research
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Anita de Waard
 
Developing data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universitiesDeveloping data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universitiesAmanda Whitmire
 
Nicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do researchNicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do researchGigaScience, BGI Hong Kong
 
Genome sharing projects around the world nijmegen oct 29 - 2015
Genome sharing projects around the world   nijmegen oct 29 - 2015Genome sharing projects around the world   nijmegen oct 29 - 2015
Genome sharing projects around the world nijmegen oct 29 - 2015Fiona Nielsen
 
GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.GigaScience, BGI Hong Kong
 
Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521Amanda Whitmire
 
Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceUniversity of Washington
 
Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The ...
Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The ...Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The ...
Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The ...Jonathan Tedds
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositoriesChris Rusbridge
 
The "social" side of digital science
The "social" side of digital scienceThe "social" side of digital science
The "social" side of digital scienceKaitlin Thaney
 
High throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesesHigh throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesespetermurrayrust
 
A Cabinet Of Web2.0 Scientific Curiosities
A Cabinet Of Web2.0 Scientific CuriositiesA Cabinet Of Web2.0 Scientific Curiosities
A Cabinet Of Web2.0 Scientific CuriositiesIan Mulvany
 
A Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and WikidataA Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and Wikidatapetermurrayrust
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 

What's hot (20)

Reproducible research: First steps.
Reproducible research: First steps. Reproducible research: First steps.
Reproducible research: First steps.
 
End-to-End eScience
End-to-End eScienceEnd-to-End eScience
End-to-End eScience
 
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013
 
Nicole Nogoy at the Auckland BMC RoadShow
Nicole Nogoy at the Auckland BMC RoadShowNicole Nogoy at the Auckland BMC RoadShow
Nicole Nogoy at the Auckland BMC RoadShow
 
Developing data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universitiesDeveloping data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universities
 
Nicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do researchNicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do research
 
Genome sharing projects around the world nijmegen oct 29 - 2015
Genome sharing projects around the world   nijmegen oct 29 - 2015Genome sharing projects around the world   nijmegen oct 29 - 2015
Genome sharing projects around the world nijmegen oct 29 - 2015
 
GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.
 
Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521
 
Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory Science
 
Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The ...
Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The ...Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The ...
Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The ...
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
 
The "social" side of digital science
The "social" side of digital scienceThe "social" side of digital science
The "social" side of digital science
 
April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Te...
April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Te...April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Te...
April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Te...
 
High throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesesHigh throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and theses
 
A Cabinet Of Web2.0 Scientific Curiosities
A Cabinet Of Web2.0 Scientific CuriositiesA Cabinet Of Web2.0 Scientific Curiosities
A Cabinet Of Web2.0 Scientific Curiosities
 
A Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and WikidataA Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and Wikidata
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 

Viewers also liked

HKU Data Curation MLIM7350 Class 7
HKU Data Curation MLIM7350 Class 7HKU Data Curation MLIM7350 Class 7
HKU Data Curation MLIM7350 Class 7Scott Edmunds
 
English presentation g 3poem -the ant and the cricket
English presentation  g 3poem -the ant and the cricketEnglish presentation  g 3poem -the ant and the cricket
English presentation g 3poem -the ant and the cricketjnv
 
Ppt on the ant and the cricket
Ppt on the ant and the cricketPpt on the ant and the cricket
Ppt on the ant and the cricketgobilladraksharani
 
Three Questions
Three QuestionsThree Questions
Three QuestionsSelma
 
Sandra's presentation on the grasshopper & the bell cricket
Sandra's presentation on the grasshopper & the bell cricketSandra's presentation on the grasshopper & the bell cricket
Sandra's presentation on the grasshopper & the bell cricketSandra Arances
 
Glimpses of the past viii
Glimpses of the past viiiGlimpses of the past viii
Glimpses of the past viiiSantosh Kumar
 
8 the ant the cricket
8 the ant the cricket8 the ant the cricket
8 the ant the cricketNVSBPL
 
The basics of cbse cce and grading system
The basics of cbse cce and grading systemThe basics of cbse cce and grading system
The basics of cbse cce and grading systemBabu Appat
 
The summit 2
The summit 2The summit 2
The summit 2jnv
 
A visit to cambridge
A visit to cambridgeA visit to cambridge
A visit to cambridgeveer203
 
Call's project (conjunction)
Call's project (conjunction)Call's project (conjunction)
Call's project (conjunction)yulianita27
 
The school boy
The school boyThe school boy
The school boyMakhan Dey
 

Viewers also liked (20)

HKU Data Curation MLIM7350 Class 7
HKU Data Curation MLIM7350 Class 7HKU Data Curation MLIM7350 Class 7
HKU Data Curation MLIM7350 Class 7
 
1 b class 8
1 b class 81 b class 8
1 b class 8
 
English presentation g 3poem -the ant and the cricket
English presentation  g 3poem -the ant and the cricketEnglish presentation  g 3poem -the ant and the cricket
English presentation g 3poem -the ant and the cricket
 
Ppt on the ant and the cricket
Ppt on the ant and the cricketPpt on the ant and the cricket
Ppt on the ant and the cricket
 
Group4 ppt
Group4 pptGroup4 ppt
Group4 ppt
 
The summit within
The summit withinThe summit within
The summit within
 
The summit within
The summit withinThe summit within
The summit within
 
Three Questions
Three QuestionsThree Questions
Three Questions
 
Sandra's presentation on the grasshopper & the bell cricket
Sandra's presentation on the grasshopper & the bell cricketSandra's presentation on the grasshopper & the bell cricket
Sandra's presentation on the grasshopper & the bell cricket
 
Glimpses of the past viii
Glimpses of the past viiiGlimpses of the past viii
Glimpses of the past viii
 
8 the ant the cricket
8 the ant the cricket8 the ant the cricket
8 the ant the cricket
 
The basics of cbse cce and grading system
The basics of cbse cce and grading systemThe basics of cbse cce and grading system
The basics of cbse cce and grading system
 
The summit 2
The summit 2The summit 2
The summit 2
 
A visit to cambridge
A visit to cambridgeA visit to cambridge
A visit to cambridge
 
Call's project (conjunction)
Call's project (conjunction)Call's project (conjunction)
Call's project (conjunction)
 
Comet
CometComet
Comet
 
The school boy
The school boyThe school boy
The school boy
 
A visit to cambridge
A visit to cambridgeA visit to cambridge
A visit to cambridge
 
Three Questions
Three QuestionsThree Questions
Three Questions
 
A short monsoon diary
A short monsoon diaryA short monsoon diary
A short monsoon diary
 

Similar to HKU Data Curation MLIM7350 Class 8

Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureRoss Mounce
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open DataRoss Mounce
 
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingScott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingGigaScience, BGI Hong Kong
 
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...GigaScience, BGI Hong Kong
 
Scott Edmunds ISMB talk on Big Data Publishing
Scott Edmunds ISMB talk on Big Data PublishingScott Edmunds ISMB talk on Big Data Publishing
Scott Edmunds ISMB talk on Big Data PublishingGigaScience, BGI Hong Kong
 
Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?LEARN Project
 
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"Anita de Waard
 
Minimal viable data reuse
Minimal viable data reuseMinimal viable data reuse
Minimal viable data reusevoginip
 
Acting as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeActing as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeLizLyon
 
Democratising biodiversity and genomics research: open and citizen science to...
Democratising biodiversity and genomics research: open and citizen science to...Democratising biodiversity and genomics research: open and citizen science to...
Democratising biodiversity and genomics research: open and citizen science to...GigaScience, BGI Hong Kong
 
AI from the Perspective of a School of Data Science
AI from the Perspective of a School of Data ScienceAI from the Perspective of a School of Data Science
AI from the Perspective of a School of Data SciencePhilip Bourne
 
Understanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceUnderstanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceAndrew Sallans
 
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
Beyond Meta-Data: Nano-Publications Recording Scientific EndeavourBeyond Meta-Data: Nano-Publications Recording Scientific Endeavour
Beyond Meta-Data: Nano-Publications Recording Scientific EndeavourKNOWeSCAPE2014
 
Benefits and practice of open science
Benefits and practice of open scienceBenefits and practice of open science
Benefits and practice of open scienceSarah Jones
 
Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...African Open Science Platform
 
Data Science Meets Biomedicine, Does Anything Change
Data Science Meets Biomedicine, Does Anything ChangeData Science Meets Biomedicine, Does Anything Change
Data Science Meets Biomedicine, Does Anything ChangePhilip Bourne
 
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...Scott Edmunds A*STAR open access workshop: how licensing can change the way w...
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...GigaScience, BGI Hong Kong
 

Similar to HKU Data Curation MLIM7350 Class 8 (20)

Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | Future
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open Data
 
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingScott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
 
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...
Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I ...
 
Scott Edmunds ISMB talk on Big Data Publishing
Scott Edmunds ISMB talk on Big Data PublishingScott Edmunds ISMB talk on Big Data Publishing
Scott Edmunds ISMB talk on Big Data Publishing
 
Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?
 
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
 
Minimal viable data reuse
Minimal viable data reuseMinimal viable data reuse
Minimal viable data reuse
 
Acting as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeActing as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decade
 
Intro to RDM
Intro to RDMIntro to RDM
Intro to RDM
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Democratising biodiversity and genomics research: open and citizen science to...
Democratising biodiversity and genomics research: open and citizen science to...Democratising biodiversity and genomics research: open and citizen science to...
Democratising biodiversity and genomics research: open and citizen science to...
 
AI from the Perspective of a School of Data Science
AI from the Perspective of a School of Data ScienceAI from the Perspective of a School of Data Science
AI from the Perspective of a School of Data Science
 
Understanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceUnderstanding the Big Picture of e-Science
Understanding the Big Picture of e-Science
 
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
Beyond Meta-Data: Nano-Publications Recording Scientific EndeavourBeyond Meta-Data: Nano-Publications Recording Scientific Endeavour
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
 
Benefits and practice of open science
Benefits and practice of open scienceBenefits and practice of open science
Benefits and practice of open science
 
Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...
 
Data Science Meets Biomedicine, Does Anything Change
Data Science Meets Biomedicine, Does Anything ChangeData Science Meets Biomedicine, Does Anything Change
Data Science Meets Biomedicine, Does Anything Change
 
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...Scott Edmunds A*STAR open access workshop: how licensing can change the way w...
Scott Edmunds A*STAR open access workshop: how licensing can change the way w...
 

More from Scott Edmunds

Free the Data! Pitch to Hong Kong Open Data Day 2019
Free the Data! Pitch to Hong Kong Open Data Day 2019Free the Data! Pitch to Hong Kong Open Data Day 2019
Free the Data! Pitch to Hong Kong Open Data Day 2019Scott Edmunds
 
Scott Edmunds: Access to Information Consultation Recomendations
Scott Edmunds: Access to Information Consultation RecomendationsScott Edmunds: Access to Information Consultation Recomendations
Scott Edmunds: Access to Information Consultation RecomendationsScott Edmunds
 
Open Data Hong Kong Update: CCCHK@10
Open Data Hong Kong Update: CCCHK@10Open Data Hong Kong Update: CCCHK@10
Open Data Hong Kong Update: CCCHK@10Scott Edmunds
 
Scott Edmunds Lightning talk: Experiences of NGO
Scott Edmunds Lightning talk: Experiences of NGOScott Edmunds Lightning talk: Experiences of NGO
Scott Edmunds Lightning talk: Experiences of NGOScott Edmunds
 
Scott Edmunds & Mendel Wong, Citizen Science #101. HKU MPA lecuture
Scott Edmunds & Mendel Wong, Citizen Science #101. HKU MPA lecutureScott Edmunds & Mendel Wong, Citizen Science #101. HKU MPA lecuture
Scott Edmunds & Mendel Wong, Citizen Science #101. HKU MPA lecutureScott Edmunds
 
HKU Data Curation MLIM7350 Class 10
HKU Data Curation MLIM7350 Class 10HKU Data Curation MLIM7350 Class 10
HKU Data Curation MLIM7350 Class 10Scott Edmunds
 
Emblematic education to know thy DNA? TEDxEduHK
Emblematic education to know thy DNA? TEDxEduHKEmblematic education to know thy DNA? TEDxEduHK
Emblematic education to know thy DNA? TEDxEduHKScott Edmunds
 
HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 Scott Edmunds
 
Hong Kong 2017 Open Data Day hackathon results: RacismWatch:HK
Hong Kong 2017 Open Data Day hackathon results: RacismWatch:HKHong Kong 2017 Open Data Day hackathon results: RacismWatch:HK
Hong Kong 2017 Open Data Day hackathon results: RacismWatch:HKScott Edmunds
 
Bauhinia Genome talk at the Galaxy Australasia Meeting
Bauhinia Genome talk at the Galaxy Australasia MeetingBauhinia Genome talk at the Galaxy Australasia Meeting
Bauhinia Genome talk at the Galaxy Australasia MeetingScott Edmunds
 
David Palmer: China Open Access week
David Palmer: China Open Access weekDavid Palmer: China Open Access week
David Palmer: China Open Access weekScott Edmunds
 
Bauhina Genome talk: Grass Roots Genomics: Using Hong Kong's Emblem to Crack ...
Bauhina Genome talk: Grass Roots Genomics: Using Hong Kong's Emblem to Crack ...Bauhina Genome talk: Grass Roots Genomics: Using Hong Kong's Emblem to Crack ...
Bauhina Genome talk: Grass Roots Genomics: Using Hong Kong's Emblem to Crack ...Scott Edmunds
 
ODHK.Meet.37 Intro to Research Data Policies and Platforms
ODHK.Meet.37 Intro to Research Data Policies and PlatformsODHK.Meet.37 Intro to Research Data Policies and Platforms
ODHK.Meet.37 Intro to Research Data Policies and PlatformsScott Edmunds
 
Scott Edmunds pitch Mosquito Alert at the Earthwatch HK Citizen Science meetup
Scott Edmunds pitch Mosquito Alert at the Earthwatch HK Citizen Science meetupScott Edmunds pitch Mosquito Alert at the Earthwatch HK Citizen Science meetup
Scott Edmunds pitch Mosquito Alert at the Earthwatch HK Citizen Science meetupScott Edmunds
 
Scott Edmunds talking Bauhina Genome at DIYBIOHK
Scott Edmunds talking Bauhina Genome at DIYBIOHKScott Edmunds talking Bauhina Genome at DIYBIOHK
Scott Edmunds talking Bauhina Genome at DIYBIOHKScott Edmunds
 
Introductory slides for the MakerBay/ODHK #ZikaHackathon
Introductory slides for the MakerBay/ODHK #ZikaHackathonIntroductory slides for the MakerBay/ODHK #ZikaHackathon
Introductory slides for the MakerBay/ODHK #ZikaHackathonScott Edmunds
 
Bauhina Genome slides for school visit
Bauhina Genome slides for school visitBauhina Genome slides for school visit
Bauhina Genome slides for school visitScott Edmunds
 
Intro for ODHK.meet.32 on Hacking the "Human Genome"
Intro for ODHK.meet.32 on Hacking the "Human Genome"Intro for ODHK.meet.32 on Hacking the "Human Genome"
Intro for ODHK.meet.32 on Hacking the "Human Genome"Scott Edmunds
 
BauhinaGenome preview at #ICG10
BauhinaGenome preview at #ICG10BauhinaGenome preview at #ICG10
BauhinaGenome preview at #ICG10Scott Edmunds
 
Amanda Meng at ODHK meet.29: Open Government Data & Social Impact
Amanda Meng at ODHK meet.29: Open Government Data & Social ImpactAmanda Meng at ODHK meet.29: Open Government Data & Social Impact
Amanda Meng at ODHK meet.29: Open Government Data & Social ImpactScott Edmunds
 

More from Scott Edmunds (20)

Free the Data! Pitch to Hong Kong Open Data Day 2019
Free the Data! Pitch to Hong Kong Open Data Day 2019Free the Data! Pitch to Hong Kong Open Data Day 2019
Free the Data! Pitch to Hong Kong Open Data Day 2019
 
Scott Edmunds: Access to Information Consultation Recomendations
Scott Edmunds: Access to Information Consultation RecomendationsScott Edmunds: Access to Information Consultation Recomendations
Scott Edmunds: Access to Information Consultation Recomendations
 
Open Data Hong Kong Update: CCCHK@10
Open Data Hong Kong Update: CCCHK@10Open Data Hong Kong Update: CCCHK@10
Open Data Hong Kong Update: CCCHK@10
 
Scott Edmunds Lightning talk: Experiences of NGO
Scott Edmunds Lightning talk: Experiences of NGOScott Edmunds Lightning talk: Experiences of NGO
Scott Edmunds Lightning talk: Experiences of NGO
 
Scott Edmunds & Mendel Wong, Citizen Science #101. HKU MPA lecuture
Scott Edmunds & Mendel Wong, Citizen Science #101. HKU MPA lecutureScott Edmunds & Mendel Wong, Citizen Science #101. HKU MPA lecuture
Scott Edmunds & Mendel Wong, Citizen Science #101. HKU MPA lecuture
 
HKU Data Curation MLIM7350 Class 10
HKU Data Curation MLIM7350 Class 10HKU Data Curation MLIM7350 Class 10
HKU Data Curation MLIM7350 Class 10
 
Emblematic education to know thy DNA? TEDxEduHK
Emblematic education to know thy DNA? TEDxEduHKEmblematic education to know thy DNA? TEDxEduHK
Emblematic education to know thy DNA? TEDxEduHK
 
HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9
 
Hong Kong 2017 Open Data Day hackathon results: RacismWatch:HK
Hong Kong 2017 Open Data Day hackathon results: RacismWatch:HKHong Kong 2017 Open Data Day hackathon results: RacismWatch:HK
Hong Kong 2017 Open Data Day hackathon results: RacismWatch:HK
 
Bauhinia Genome talk at the Galaxy Australasia Meeting
Bauhinia Genome talk at the Galaxy Australasia MeetingBauhinia Genome talk at the Galaxy Australasia Meeting
Bauhinia Genome talk at the Galaxy Australasia Meeting
 
David Palmer: China Open Access week
David Palmer: China Open Access weekDavid Palmer: China Open Access week
David Palmer: China Open Access week
 
Bauhina Genome talk: Grass Roots Genomics: Using Hong Kong's Emblem to Crack ...
Bauhina Genome talk: Grass Roots Genomics: Using Hong Kong's Emblem to Crack ...Bauhina Genome talk: Grass Roots Genomics: Using Hong Kong's Emblem to Crack ...
Bauhina Genome talk: Grass Roots Genomics: Using Hong Kong's Emblem to Crack ...
 
ODHK.Meet.37 Intro to Research Data Policies and Platforms
ODHK.Meet.37 Intro to Research Data Policies and PlatformsODHK.Meet.37 Intro to Research Data Policies and Platforms
ODHK.Meet.37 Intro to Research Data Policies and Platforms
 
Scott Edmunds pitch Mosquito Alert at the Earthwatch HK Citizen Science meetup
Scott Edmunds pitch Mosquito Alert at the Earthwatch HK Citizen Science meetupScott Edmunds pitch Mosquito Alert at the Earthwatch HK Citizen Science meetup
Scott Edmunds pitch Mosquito Alert at the Earthwatch HK Citizen Science meetup
 
Scott Edmunds talking Bauhina Genome at DIYBIOHK
Scott Edmunds talking Bauhina Genome at DIYBIOHKScott Edmunds talking Bauhina Genome at DIYBIOHK
Scott Edmunds talking Bauhina Genome at DIYBIOHK
 
Introductory slides for the MakerBay/ODHK #ZikaHackathon
Introductory slides for the MakerBay/ODHK #ZikaHackathonIntroductory slides for the MakerBay/ODHK #ZikaHackathon
Introductory slides for the MakerBay/ODHK #ZikaHackathon
 
Bauhina Genome slides for school visit
Bauhina Genome slides for school visitBauhina Genome slides for school visit
Bauhina Genome slides for school visit
 
Intro for ODHK.meet.32 on Hacking the "Human Genome"
Intro for ODHK.meet.32 on Hacking the "Human Genome"Intro for ODHK.meet.32 on Hacking the "Human Genome"
Intro for ODHK.meet.32 on Hacking the "Human Genome"
 
BauhinaGenome preview at #ICG10
BauhinaGenome preview at #ICG10BauhinaGenome preview at #ICG10
BauhinaGenome preview at #ICG10
 
Amanda Meng at ODHK meet.29: Open Government Data & Social Impact
Amanda Meng at ODHK meet.29: Open Government Data & Social ImpactAmanda Meng at ODHK meet.29: Open Government Data & Social Impact
Amanda Meng at ODHK meet.29: Open Government Data & Social Impact
 

Recently uploaded

WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024Stephanie Beckett
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyUXDXConf
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty SecureFemke de Vroome
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreelreely ones
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekCzechDreamin
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsStefano
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfChristopherTHyatt
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxEasyPrinterHelp
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 

Recently uploaded (20)

WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreel
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdf
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptx
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 

HKU Data Curation MLIM7350 Class 8

  • 1. Class 8…making things FAIR 'if I have seen further it is by standing on the shoulders of giants'. Scott Edmunds, HKU Data Curation MLIM7350
  • 2. Communicating in-class • Chat channel: • http://backchannelchat.com/chat/dw131 • Let me know to slow down/speed up
  • 3. https://osf.io/cgpzb/ Open Science (Open Access & Open Data) survey of Hong Kong Reading/Reflection Most people mentioned training of librarians: Tak Hei Lam: “Training should be provided to librarians so that they have adequate knowledge about data curation and provide professional support and advice for the researchers to sharing of data. Also, librarians can provide training and workshop to change the mindset of the researcher not to rely on the impact factor but on other to other comprehensive research metrics such as PlumX” Lijia Yu: At the same time, in big data era, the research will be increasingly migrating to the cloud, so this should be done in an organized manner. Lots of talk on incentive systems & policy, but little on infrastructure other than: NEED FOR A PLAN/LEADERSHIP
  • 4. HKU Repeatability in HK Research Experiment (homework) Feedback? What have we found?
  • 5. HKU Repeatability in HK Research Experiment (homework)
  • 7. Interesting examples Several examples of restrictions with ID data http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0165978
  • 8. Interesting examples Several examples of restrictions with ID data http://www.vox.com/2015/6/17/8796225/mers-virus-data-sharing
  • 9. Interesting examples Lots of data in Dryad, but 1 H7N9 example isn’t resolving http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0148506
  • 10. Story so far • HKU publishing a lot of survey based research in PLOS • 3 examples from “Children of 1997” birth cohort. Access to data involves emailing DAC • External databases: 2 examples in Dryad data (one not working), 1 example in OSF, 1 example in scholarhub, lots in figshare • So far 2 have data with broken URLs, 1/3 are controlled access, 1/4 have summary but not raw data
  • 11. WHAT EXACTLY IS “RESEARCH DATA"?
  • 12. Research Data 1665? Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995
  • 13. Esoteric formats, poorly structured, Tabular, often spreadsheet based Issues open data community well used to (data cleaning, scraping, etc.,) The long tail of scientific data…
  • 14. Science Data Volumes Exabytes Petabytes100’s of Petabytes Sequencing Mass Spec Astrophysics HE Physics Biology Imaging Square Kilometer Array Large Hadron Collider
  • 15. Big Data in Healthcare http://dx.doi.org/10.1186/s13742-016-0117-6
  • 16. Big Data in Healthcare: challenges • 80% of health data unstructured (100’s of forms/formats) • Medical Imaging archives increasing 20-40% per year • Genomics data will increase data volumes exponentially • Patients expect extra privacy protection if they are going to fully participate in data driven research Source: https://www.healthcare.siemens.com/magazine/mso-big-data-and-healthcare-2.html
  • 17. Open Data in Physics 1961 CERN pre-prints shelf http://cerncourier.com/cws/article/cern/28654 http://arxiv.org/ 1991-date arXiv
  • 18. Open Data in Earth Sciences https://pangaea.de/Established 1987 (online since 1995)…
  • 19. Open Data in Earth Sciences #Climategate UAE emails “scandal” Is it possible to be too open?
  • 20. Closed Data in Chemistry
  • 21. Open Data in Biology 1934: newsletter era 1987: online era1980: database era 2010’s: “bioinformatics bingo” era
  • 22. BGI HK Chamber O’Illumina’s The LHC of Biology? 20PB of storage
  • 23. Post-Human Genome Project 1st Gen 2nd (next) Gen Source: http://www.genome.gov/sequencingcosts/ (with apologies)
  • 24. Omes & more omes!
  • 25. Other Ome(s): mass spectrometry data https://en.wikipedia.org/wiki/Mass_spectrometry Nadina Wiórkiewicz
  • 26. Rise of mass spectrometry data https://doi.org/10.1093/nar/gkv1352
  • 27. Challenges: Rise of big imaging data http://www.nature.com/nmeth/journal/v12/n1/full/nmeth.3222.html
  • 28. Challenges: Rise of big imaging data https://openi.nlm.nih.gov/detailedresult.php?img=PMC3171117_JCB_201108095_RGB_Fig2&req=4 http://journals.sagepub.com/doi/10.1177/1087057114528537 HCS: High Content Screens AKA High Throughput Screening: High volumes, growing uptake – TBs of data New ways of sharing/publishing data with OMERO/JCB data viewer
  • 29. Imaging Challenges: 100s of formats http://www.openmicroscopy.org/site/products/bio-formats
  • 31. Sharing/reproducibility helped by stability of: 1. Platforms 1. Repositories 2. Standards 1st Gen 2nd Gen :
  • 32. Genomics Data Sharing Policies… 1. Automatic release of sequence assemblies within 24 hours. 2. Immediate publication of finished annotated sequences. 3. Aim to make the entire sequence freely available in the public domain for both research and development in order to maximise benefits to society. Bermuda Accords 1996/1997/1998: 1. Sequence traces from whole genome shotgun projects are to be deposited in a trace archive within one week of production. 2. Whole genome assemblies are to be deposited in a public nucleotide sequence database as soon as possible after the assembled sequence has met a set of quality evaluation criteria. Fort Lauderdale Agreement, 2003: The goal was to reaffirm and refine, where needed, the policies related to the early release of genomic data, and to extend, if possible, similar data release policies to other types of large biological datasets – whether from proteomics, biobanking or metabolite research. Toronto International data release workshop, 2009:
  • 34. Three decades of sharing infrastructure: Genbank
  • 35. Scaling up of sharing: 1000 genomes http://www.internationalgenome.org/
  • 36. Three decades of sharing infrastructure: INSDC http://www.insdc.org/
  • 37. Sharing aids individuals Piwowar HA, Day RS, Fridsma DB (2007) PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308 Sharing Detailed Research Data Is Associated with Increased Citation Rate. Every 10 datasets collected contributes to at least 4 papers in the following 3-years. Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a
  • 38. 0 100 200 300 400 500 600 700 rice wheat Rice v Wheat: consequences of publically available genome data. Sharing aids fields…
  • 39. Sharing aids growth of databases… http://scienceblogs.com/digitalbio/2015/01/30/bio-databases-2015/
  • 40. Sharing aids growth of standards… Why do we need standards? https://xkcd.com/927/
  • 41. Sharing aids growth of standards… Why do we need standards? http://www.biochemsoctrans.org/content/36/1/33
  • 42. Checklists aid the growth of sharing… http://www.equator-network.org/
  • 43. There are over 860 databases & 675 standards in the life sciences Formats Terminologies Guidelines
  • 44. Guidelines = Minimum information reporting requirements, checklists o Report the same core, essential information o e.g. ARRIVE guidelines Terminologies = Controlled vocabularies, taxonomies, thesauri, ontologies etc. o Unambiguously refer to an entity o e.g. Gene Ontology Models/Formats = Conceptual model, conceptual schema, exchange formats o Allow data to flow from one system to another o e.g. FASTA Enablers: to better describe, share and query data Formats Terminologies Guidelines
  • 46. Exercise: Use Biosharing to answer the following? To share your work are there standards you should follow? Are there specialized curated databases you can use? A. You work in the area of functional MRI imaging and are producing 100’s of GBs of fMRI brain scan data. B. You are an immunologist using flow cytometry to sort cells. C. You are a chemist looking at the 3D crystal structure of proteins using NMR https://biosharing.org/ Potential collaborators would like to use your data. Sabban, Sari
  • 50. Visualisations & DOIs for workflows http://www.gigasciencejournal.com/series/Galaxy 50
  • 51. Facilitate reproducibility, reuse & sharing & publish outputs of: Knitr, Sweave, Jupyter/iPython Notebook, etc. Open Documents Reward Open/Dynamic Workbooks
  • 55. https://opensource.org/licenses Open Source v Open Data Licenses Same ethos (open source begat open data), different contexts • OSS designed for continuing development, OD for making objects available • IP issues. Software can be patented, data (generally) can’t • More business models for software than data (so far…) • Wider selection of OSS licenses, and more options to fine- tune access (Linking, Distribution, Modification, Sublicensing, Patents/Trademarks, etc.)
  • 56. • Now researchers are producing such large & heterogeneous datasets, what do you think the challenges are for producers and users? • What are the legal implications of mixing data and software? • What do you think the security issues of accessing these complex combined research objects are? Questions to ask?
  • 57. Questions? | 15 minute break
  • 58. Research Data: Pop Quiz What was #climategate? What is the INSDC, and who are the three INSDC partners? What is the estimated yearly growth of medical imaging data? What are bioboxes? How many databases are currently listed in biosharing? Which of the reporting guidelines/checklists are for A) animals, B) biological science, and C) clinical research: MIBBI, ARRIVE and Equator
  • 59. ETHICS & DATA SECURITY ISSUES
  • 61. Ethics: clinical trials need registration http://www.hkuctr.com/
  • 62. Ethics: need informed consent http://www.med.hku.hk/images/document/04research/institution/5QMH_IRB_GUIDAN CE_NOTES_FOR_THE_PREPARATION_OF_PATIENT_CONSENT.pdf Where does data sharing fit into this? WILL MY TAKING PART IN THIS STUDY BE KEPT CONFIDENTIAL? You will need to obtain the patient’s permission to allow restricted access to their medical records and to the information collected about them in the course of the study. You should explain that all information collected about them will be kept strictly confidential. A suggested form of words is: “All information which is collected about you during the course of the research will be kept strictly confidential. Any information about you which leaves the hospital/surgery will have your name and address removed so that you cannot be recognised from it.” HKU Guideline Notes - for Preparation of Subject Information Sheet & Informed Consent Form:
  • 63. Ethics: includes animal research http://www.med.hku.hk/research/research-ethics/animal-ethics-culatr
  • 64. Ethics: includes animal research https://www.nc3rs.org.uk/arrive-guidelines
  • 65.
  • 66. Lots of tools available: anonymisation https://www.ukdataservice.ac.uk/manage-data/tools-and-templates
  • 67. Lots of tools available: encryption https://www.brookes.ac.uk/Research/Research-ethics/Encrypting-files/
  • 68. Lots of tools available: DAC & brokering https://blog.repositive.io/getting-data-out-of-the-ega/
  • 69. Lots of tools available: DAC & brokering http://www.ckbiobank.org/site/
  • 70. Lots of tools available: DAC & brokering http://www.ckbiobank.org/site/
  • 71. Kinds of identifying information • Direct identifiers – Names, addresses, postcode information, telephone numbers or pictures • Indirect identifiers – In combination with other information, would identify e.g. information on workplace, occupation or exceptional values of characteristics like salary or age http://www.data-archive.ac.uk/create-manage/consent-ethics/anonymisation
  • 73. Anonymising audio-visual data • Anonymisation of audio-visual data, such as editing of digital images or audio recordings, should be done sensitively. Bleeping out real names or place names is acceptable, but disguising voices by altering the pitch in a recording, or obscuring faces by pixellating sections of a video image significantly reduces the usefulness of data. These processes are also highly labour intensive and expensive. • If confidentiality of audio-visual data is an issue, it is better to obtain the participant's consent to use and share the data unaltered. Where anonymisation would result in too much loss of data content, regulating access to data can be considered as a better strategy. • We urge researchers to consider and judge at an early stage the implications of depositing materials containing confidential information and to get in touch to consult on any potential issues. https://www.ukdataservice.ac.uk/manage-data/legal- ethical/anonymisation/qualitative
  • 74. Considerations for medical imaging https://openfmri.org/de-identification/ https://sourceforge.net/projects/privacyguard/ Need to also ensure DICOM (Digital Imaging and Communications in Medicine) metadata also passes through de- identification toolkit MRI brain scans first undergo skull stripping Automated Defacing Tools required beyond this
  • 75. Considerations for medical images https://bmcmedgenet.biomedcentral.com/articles/10.1186/1471-2350-15-21 https://bmcmedgenet.biomedcentral.com/articles/10.1186/1471-2350-11-26 • Sharing of clinical images crucial in understanding “phenotypes” • Require ”consent to publish”, but challenges doing this with ill people, children, elderly, and disadvantaged • Further challenges in era of social media, open access and wikipedia • Security issues protecting signed consent forms
  • 76. Not just a metadata problem… http://science.sciencemag.org/content/339/6117/321
  • 77. Extra considerations for HK Hospital Authority restrictions on data • Have to apply to Hospital Authority to access public health data • Only approved 14 data requests (as of May 2016) • If approved requires data recovery charges (collect $250,000 HKD a year from this) • Can publish aggregate/summary data in journals, but not share data • Only approves academic use, not citizens or industry/pharma Via FOI request: https://accessinfo.hk/en/request/request_for_statistics_on_data_c
  • 78. Extra considerations for China Human genetic data needs MOST approval Article 2: The term "human genetic resources" in the Measures refers to the genetic materials such as human organs, tissues, cells, blood specimens, preparations of any types or recombinant DNA constructs, which contain human genome, genes or gene products as well as to the information related to such materials. Second, any international collaborative project involving Chinese human genetic resources, for example international research cooperation and exporting human genetic resources or taking such resources outside of the territory of China should shall apply to MOST for examination and approval prior to entering into an official contract. And Chinese collaborating party shall be responsible for going through the due formalities of application for approval. (See Article 11)
  • 80. Can this data be easily de-identified & shared? http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152381 “The individual in this manuscript has given written informed consent (as outlined in PLOS consent form) to publish their images. Following approval by the Institutional Review Board (IRB) of The University of Hong Kong and Hospital Authority Hong Kong West Cluster (UW 14– 159); 20 individuals, 10 male and 10 female volunteers, were properly instructed and gave consent to participate in this study by signing the appropriate informed consent paperwork. “
  • 81. FAIR or unfair? Principled publishing for data. 公平 或者不公平? 数据发表的原则
  • 82. What is FAIR (公平的)? Adverb Without cheating or trying to achieve unjust advantage. ‘no one could say he played fair’ Adjective Treating people equally without favouritism or discrimination. ‘the group has achieved fair and equal representation for all its members’ ‘a fairer distribution of wealth’ fair /fɛː/
  • 83. 475, 267 (2011) http://www.nature.com/news/2011/110720/full/475267a.html “Wide distribution of information is key to scientific progress, yet traditionally, Chinese scientists have not systematically released data or research findings, even after publication.“ “There have been widespread complaints from scientists inside and outside China about this lack of transparency. ” “Usually incomplete and unsystematic, [what little supporting data released] are of little value to researchers and there is evidence that this drives down a paper's citation numbers.” Is this FAIR? 这是FAIR?
  • 84. FAIR questions to ask? Is the raw data publically available? Are the reagents (plasmids, cells, antibodies, etc.) available? Are detailed protocols available? Can I access the processed data & results (supporting the figures)? Was this all available BEFORE publication to the peer reviewers? Can I inspect the peer reviews? Can I publish/link +/-ve replication experiments to this?
  • 85. A more FAIR approach: Open Data?
  • 86. Research Objects: a concept & model http://www.researchobject.org/ • Supporting publication of more than just PDFs, making data, code, & other resources first class citizens of scholarship. • Recognizing that there is often a need to publish collections of these resources together as one shareable, cite-able resource. • Enriching these resources and collections with any & all additional information required to make research reusable, & reproducible!
  • 87. Importance of metadata: context (& discoverability) https://library.stanford.edu/research/data-management-services/data-best-practices/best-practices-file-naming https://twitter.com/AlisonMcNab/status/751375987624009728/photo/1 ?
  • 88. Novel tools/formats for data interoperability/handling: ISA Importance of metadata: context (& discoverability)
  • 89. Where do you set it? Experiment (e.g. International Cancer Genome Consortium) Datasets (e.g. cancer type) Sample (e.g. specimen xyz) e.g. doi:10.5524/100001 e.g. doi:10.5524/100001-2 e.g. doi:10.5524/100001-2000 or doi:10.5524/100001_xyz Smaller still? Importance of granularity Papers Data/ Micropubs NanopubsFacts/Assertions (~1013 in literature)
  • 92. Assertion Nanopublication URL Provenance PublicationInfo assertio n opm: was Derived From opm: wasGene- ratedBy this nanopub dcterms: created pav: authored- By associa- tion a sio:statis- ticalAssociation sio:has- measurem entValue Association_1_ p_value a Sio:probability- value sio:has-value 6.56e-5 ^^xsd:float sio: refers-to dcterms: DOI … Integrity Key An Individual association between concepts: • statement or declaration • measurement • hypothetical inference • quantitative or qualitative Guarantee immutability after publication Unique, persistent and resolvable identifier How this assertion came to be, methods, evidence, context, etc. • Detailed attribution for authors, institutions, lab technicians, curators • License info • Publication date A nanopub represents structured data along with its provenance in a single publishable & citable entity. http://nanopub.org/
  • 93. Lots of models/standards/guidelines Where does that leave us? ? 5★ open data
  • 94. A mnemonic to remember: FAIR 一个帮助记忆的词语:FAIR http://www.nature.com/articles/sdata201618 http://www.datafairport.org/ Findable 可发现的 Accessible 可得到的 Interoperable能共同使用的 Reusable 可以再度使用的 Lots of models/standards/guidelines Where does that leave us?
  • 95. A mnemonic to remember: FAIR http://www.nature.com/articles/sdata201618 http://www.datafairport.org/
  • 96. A mnemonic to remember: FAIR http://www.nature.com/articles/sdata201618 http://www.datafairport.org/ To be Findable: F1. (meta)data are assigned a globally unique and persistent identifier F2. data are described with rich metadata (defined by R1 below) F3. metadata clearly and explicitly include the identifier of the data it describes F4. (meta)data are registered or indexed in a searchable resource
  • 97. A mnemonic to remember: FAIR http://www.nature.com/articles/sdata201618 http://www.datafairport.org/ To be Accessible: A1. (meta)data are retrievable by their identifier using a standardized communications protocol A1.1 the protocol is open, free, and universally implementable A1.2 the protocol allows for an authentication and authorization procedure, where necessary A2. metadata are accessible, even when the data are no longer available
  • 98. A mnemonic to remember: FAIR http://www.nature.com/articles/sdata201618 http://www.datafairport.org/ To be Interoperable: I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. I2. (meta)data use vocabularies that follow FAIR principles I3. (meta)data include qualified references to other (meta)data
  • 99. A mnemonic to remember: FAIR http://www.nature.com/articles/sdata201618 http://www.datafairport.org/ To be Reusable: R1. meta(data) are richly described with a plurality of accurate and relevant attributes R1.1. (meta)data are released with a clear and accessible data usage license R1.2. (meta)data are associated with detailed provenance R1.3. (meta)data meet domain-relevant community standards
  • 100. Beyond a mnemonic: FAIR ecosystems FAIRifier tool
  • 101. Beyond a mnemonic: FAIR ecosystems • A particular class of FAIR Data System to provide support for data interoperability; • Supports publication, search and access to FAIR data. • Fosters an ecosystems of applications and services; • Federated architecture: different FAIRports (and other FAIR Data Systems) are interconnectable; • Supports citations of datasets and data items; • Provides metrics for data usage and citation; A ‘FAIRpoint or FAIRport’ can be any specific data instance following FAIR data principles. http://www.datafairport.org/
  • 102. Beyond a mnemonic: FAIR ecosystems http://www.datafairport.org/ ?
  • 103. Beyond a mnemonic: FAIR ecosystems https://www.fair-access.net.au/fair-statement “By 2020, Australian publicly funded researchers and research organisations will have in place policies, standards and practices to make publicly funded research outputs findable, accessible, interoperable and reusable.”
  • 104. DTL/ELIXIR-NL “Bring Your Own Data Party” GigaScience/BGI HK Metabolomics ISA-TAB athon v More FAIR mnemonics: “BYODs”
  • 105. FAIR Data in the wild Taking a microscope to the publication process
  • 107. How FAIR can we get? 如何获取FAIR? Data sets Analyses Open-Paper Open-Review DOI:10.1186/2047-217X-1-18 >50,000 accesses & 885 citations Open-Code 7 reviewers tested data in ftp server & named reports published DOI:10.5524/100044 Open-Pipelines Open-Workflows DOI:10.5524/100038 Open-Data 78GB CC0 data Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/ >40,000 downloads Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2
  • 108. Can we reproduce results? SOAPdenovo2 S. aureus pipeline
  • 109. The SOAPdenovo2 Case study Subject to and test with 3 models: Data Method/Experi mental protocol Findings Types of resources in an RO ISA-TAB/ISA2OWL Nanopublication Wfdesc/ISA- TAB/ISA2OWL Models to describe each resource type
  • 110.
  • 111. 1. While there are huge improvements to the quality of the resulting assemblies, other than the tables it was not stressed in the text that the speed of SOAPdenovo2 can be slightly slower than SOAPdenovo v1. 2. In the testing an assessment section (page 3), based on the correct results in table 2, where we say the scaffold N50 metric is an order of magnitude longer from SOAPdenovo2 versus SOAPdenovo1, this was actually 45 times longer 3. Also in the testing an assessment section, based on the correct results in table 2, where we say SOAPdenovo2 produced a contig N50 1.53 times longer than ALL-PATHS, this should be 2.18 times longer. 4. Finally in this section, where we say the correct assembly length produced by SOAPdenovo2 was 3-80 fold longer than SOAPdenovo1, this should be 3-64 fold longer.
  • 112. Lessons Learned 经验教训 • Most published research findings are false. Or at least have errors • With enough effort is possible to push button(s) & recreate a result from a paper with current tools • Being FAIR can be COSTLY. How much are you willing to spend? Who will build FAIR infrastructure? • Much easier to make things FAIR before rather than after publication. BYODs useful intermediate here
  • 113. http://www.nature.com/ng/journal/v48/n4/full/ng.3544.html “The question to ask in order to be a data steward, to handle data or to simplify a set of standards is the same: “is it FAIR”?”
  • 115. Levels of FAIRness: A-F of FAIR data In class activity: How FAIR is this data? 1. Data from: Live poultry exposure and public response to influenza A(H7N9) in urban and rural China during two epidemic waves in 2013-2014 http://hub.hku.hk/cris/dataset/dataset93128 1. Supporting data for "Genomic analyses revealFAM84B and the NOTCH pathway are associated with the progression of esophageal squamous cell carcinoma” http://dx.doi.org/10.5524/100181 1. Linked Drug-Drug Interactions (LIDDI) https://datahub.io/dataset/linked-drug-drug-interactions-liddi http://content.iospress.com/articles/information-services-and-use/isu824
  • 116. Reflection: how fair is FAIR? Read the FAIR principles paper. Do you think they are applicable and feasible for HK? If it is feasible, what is needed to implement them? http://www.nature.com/articles/sdata201618
  • 117. Any questions? Does anyone have BYO data for the curation/cleaning workshop?
  • 118. Final Project • For the final project for this course, you can choose from 3 assignment options. • The assignment is due on the 15th May and it is worth 40% of your grade. • Time will be set aside for presenting on this during the final class on the 24th April: covering why you chose the option, what discipline/dataset/topic you are covering, and what work you've done so far (5 mins per student including any group feedback)
  • 119. Final Project: Option 1 Write an Annotated Bibliography about data curation practices in an academic discipline of your choosing. • Choose a discipline (sciences, social sciences, & humanities) OR choose the topic of “open data.” • Summarize data practices in your chosen discipline or topic. (5-7 sentences) • Find 7-10 sources that relate that discipline or topic to data creation, management, and/or curation. • Provide a citation for the source in APA style. • Write a short annotation that summarizes the content of the source. You may include quotes from the source sparingly, but the annotations should be mostly, if not entirely, in your own words. (3-5 sentences) • Explain the relevance of the source with relation to the data practices of your chosen discipline or topic. (1-2 sentences) • Find a few example public datasets to demonstrate the above points. Cite the data in the relevant places in the Bibliography according to the Data Citation Principles. • Refer to this guide for more information about annotated bibliographies: http://sites.umuc.edu/library/libhow/bibliography_tutorial.cfm. Your annotation should be in the “Descriptive” style.
  • 120. Final Project: Option 2 Using a relevant dataset (this can either be from the literature curation exercise, a BYO dataset, or one given to you), write a report that includes a description of the dataset, a Data Management Plan, and a guidelines document for the researcher(s). • Describe the dataset that explains the form of the data and the academic discipline in which it was created. This paragraph should provide context for the (3-5 sentences) 1-2 page Data Management Plan following the guidelines from HKU or granting body such as NSF. • 1 page guidelines document that could be presented to the researcher(s) that provides guidelines for their data (extant and forthcoming): – Preservation – Appraisal – Documentation • For the DMP and the guidelines document, you can extrapolate from the your dataset to imagine additional details about the research practices that created the dataset and will create more data in the future. • Look for suitable data repositories that can host this data (institutional, general purpose, or subject specific), and if there is one relevant then publish the data if you have permission, and correctly cite the data in the relevant places in your report. [disclaimer: if have permission]
  • 121. Final Project: Option 3 Prepare a 30 minute data curation workshop that you could teach to researchers that would provide them the necessary details to understand why data curation is relevant to them and best practices they should follow. • Slide deck that introduces data curation for a researcher audience. (No more than 40 slides.) • Presenter outline that describes the important points for each slide. • Topics that might be addressed in your workshop: the value of data management, writing a data management plan, data repository options. You can assume your audience is researchers are at HKU. • Make sure all of the content is copyright free, and share the final material openly (e.g. figshare, scholarhub, OER commons, etc.), and with sufficient metadata to make it discoverable.
  • 122. Looking ahead… • Submit 1 paragraph refection on FAIR principles through moodle forum • Next class (22nd April) is hands-on curation workshop with Dr Chris Hunter – Bring laptops and any data you may have for a data cleaning exercise • Final project due 15th May – Need to present preliminary version on 26th April to get feedback before completion. Send me slides by the 25th April so I can get them ready for the class