Successfully reported this slideshow.
Your SlideShare is downloading. ×

HKU Data Curation MLIM7350 Class 8

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 122 Ad

HKU Data Curation MLIM7350 Class 8

Download to read offline

Scott Edmunds slides for class 8 from the HKU Data Curation (module MLIM7350 from the Faculty of Education) course covering science data, medical data and ethics, and the FAIR data principles.

Scott Edmunds slides for class 8 from the HKU Data Curation (module MLIM7350 from the Faculty of Education) course covering science data, medical data and ethics, and the FAIR data principles.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Advertisement

Similar to HKU Data Curation MLIM7350 Class 8 (20)

More from Scott Edmunds (20)

Advertisement

Recently uploaded (20)

HKU Data Curation MLIM7350 Class 8

  1. 1. Class 8…making things FAIR 'if I have seen further it is by standing on the shoulders of giants'. Scott Edmunds, HKU Data Curation MLIM7350
  2. 2. Communicating in-class • Chat channel: • http://backchannelchat.com/chat/dw131 • Let me know to slow down/speed up
  3. 3. https://osf.io/cgpzb/ Open Science (Open Access & Open Data) survey of Hong Kong Reading/Reflection Most people mentioned training of librarians: Tak Hei Lam: “Training should be provided to librarians so that they have adequate knowledge about data curation and provide professional support and advice for the researchers to sharing of data. Also, librarians can provide training and workshop to change the mindset of the researcher not to rely on the impact factor but on other to other comprehensive research metrics such as PlumX” Lijia Yu: At the same time, in big data era, the research will be increasingly migrating to the cloud, so this should be done in an organized manner. Lots of talk on incentive systems & policy, but little on infrastructure other than: NEED FOR A PLAN/LEADERSHIP
  4. 4. HKU Repeatability in HK Research Experiment (homework) Feedback? What have we found?
  5. 5. HKU Repeatability in HK Research Experiment (homework)
  6. 6. Interesting examples http://hub.hku.hk/handle/10722/208585 Is data in a HKU thesis sufficient?
  7. 7. Interesting examples Several examples of restrictions with ID data http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0165978
  8. 8. Interesting examples Several examples of restrictions with ID data http://www.vox.com/2015/6/17/8796225/mers-virus-data-sharing
  9. 9. Interesting examples Lots of data in Dryad, but 1 H7N9 example isn’t resolving http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0148506
  10. 10. Story so far • HKU publishing a lot of survey based research in PLOS • 3 examples from “Children of 1997” birth cohort. Access to data involves emailing DAC • External databases: 2 examples in Dryad data (one not working), 1 example in OSF, 1 example in scholarhub, lots in figshare • So far 2 have data with broken URLs, 1/3 are controlled access, 1/4 have summary but not raw data
  11. 11. WHAT EXACTLY IS “RESEARCH DATA"?
  12. 12. Research Data 1665? Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995
  13. 13. Esoteric formats, poorly structured, Tabular, often spreadsheet based Issues open data community well used to (data cleaning, scraping, etc.,) The long tail of scientific data…
  14. 14. Science Data Volumes Exabytes Petabytes100’s of Petabytes Sequencing Mass Spec Astrophysics HE Physics Biology Imaging Square Kilometer Array Large Hadron Collider
  15. 15. Big Data in Healthcare http://dx.doi.org/10.1186/s13742-016-0117-6
  16. 16. Big Data in Healthcare: challenges • 80% of health data unstructured (100’s of forms/formats) • Medical Imaging archives increasing 20-40% per year • Genomics data will increase data volumes exponentially • Patients expect extra privacy protection if they are going to fully participate in data driven research Source: https://www.healthcare.siemens.com/magazine/mso-big-data-and-healthcare-2.html
  17. 17. Open Data in Physics 1961 CERN pre-prints shelf http://cerncourier.com/cws/article/cern/28654 http://arxiv.org/ 1991-date arXiv
  18. 18. Open Data in Earth Sciences https://pangaea.de/Established 1987 (online since 1995)…
  19. 19. Open Data in Earth Sciences #Climategate UAE emails “scandal” Is it possible to be too open?
  20. 20. Closed Data in Chemistry
  21. 21. Open Data in Biology 1934: newsletter era 1987: online era1980: database era 2010’s: “bioinformatics bingo” era
  22. 22. BGI HK Chamber O’Illumina’s The LHC of Biology? 20PB of storage
  23. 23. Post-Human Genome Project 1st Gen 2nd (next) Gen Source: http://www.genome.gov/sequencingcosts/ (with apologies)
  24. 24. Omes & more omes!
  25. 25. Other Ome(s): mass spectrometry data https://en.wikipedia.org/wiki/Mass_spectrometry Nadina Wiórkiewicz
  26. 26. Rise of mass spectrometry data https://doi.org/10.1093/nar/gkv1352
  27. 27. Challenges: Rise of big imaging data http://www.nature.com/nmeth/journal/v12/n1/full/nmeth.3222.html
  28. 28. Challenges: Rise of big imaging data https://openi.nlm.nih.gov/detailedresult.php?img=PMC3171117_JCB_201108095_RGB_Fig2&req=4 http://journals.sagepub.com/doi/10.1177/1087057114528537 HCS: High Content Screens AKA High Throughput Screening: High volumes, growing uptake – TBs of data New ways of sharing/publishing data with OMERO/JCB data viewer
  29. 29. Imaging Challenges: 100s of formats http://www.openmicroscopy.org/site/products/bio-formats
  30. 30. V Genomics: open-data success story?
  31. 31. Sharing/reproducibility helped by stability of: 1. Platforms 1. Repositories 2. Standards 1st Gen 2nd Gen :
  32. 32. Genomics Data Sharing Policies… 1. Automatic release of sequence assemblies within 24 hours. 2. Immediate publication of finished annotated sequences. 3. Aim to make the entire sequence freely available in the public domain for both research and development in order to maximise benefits to society. Bermuda Accords 1996/1997/1998: 1. Sequence traces from whole genome shotgun projects are to be deposited in a trace archive within one week of production. 2. Whole genome assemblies are to be deposited in a public nucleotide sequence database as soon as possible after the assembled sequence has met a set of quality evaluation criteria. Fort Lauderdale Agreement, 2003: The goal was to reaffirm and refine, where needed, the policies related to the early release of genomic data, and to extend, if possible, similar data release policies to other types of large biological datasets – whether from proteomics, biobanking or metabolite research. Toronto International data release workshop, 2009:
  33. 33. https://doi.org/10.1093/gigascience/giw003
  34. 34. Three decades of sharing infrastructure: Genbank
  35. 35. Scaling up of sharing: 1000 genomes http://www.internationalgenome.org/
  36. 36. Three decades of sharing infrastructure: INSDC http://www.insdc.org/
  37. 37. Sharing aids individuals Piwowar HA, Day RS, Fridsma DB (2007) PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308 Sharing Detailed Research Data Is Associated with Increased Citation Rate. Every 10 datasets collected contributes to at least 4 papers in the following 3-years. Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a
  38. 38. 0 100 200 300 400 500 600 700 rice wheat Rice v Wheat: consequences of publically available genome data. Sharing aids fields…
  39. 39. Sharing aids growth of databases… http://scienceblogs.com/digitalbio/2015/01/30/bio-databases-2015/
  40. 40. Sharing aids growth of standards… Why do we need standards? https://xkcd.com/927/
  41. 41. Sharing aids growth of standards… Why do we need standards? http://www.biochemsoctrans.org/content/36/1/33
  42. 42. Checklists aid the growth of sharing… http://www.equator-network.org/
  43. 43. There are over 860 databases & 675 standards in the life sciences Formats Terminologies Guidelines
  44. 44. Guidelines = Minimum information reporting requirements, checklists o Report the same core, essential information o e.g. ARRIVE guidelines Terminologies = Controlled vocabularies, taxonomies, thesauri, ontologies etc. o Unambiguously refer to an entity o e.g. Gene Ontology Models/Formats = Conceptual model, conceptual schema, exchange formats o Allow data to flow from one system to another o e.g. FASTA Enablers: to better describe, share and query data Formats Terminologies Guidelines
  45. 45. https://biosharing.org/ Need for databases of databases
  46. 46. Exercise: Use Biosharing to answer the following? To share your work are there standards you should follow? Are there specialized curated databases you can use? A. You work in the area of functional MRI imaging and are producing 100’s of GBs of fMRI brain scan data. B. You are an immunologist using flow cytometry to sort cells. C. You are a chemist looking at the 3D crystal structure of proteins using NMR https://biosharing.org/ Potential collaborators would like to use your data. Sabban, Sari
  47. 47. SharingOpen Data
  48. 48. Methods Answer Metadata softwareAnalysis (Pipelines) Workflows/ Environments Idea Study Rewarding the DOI, etc. Publication Publication Publication Data
  49. 49. gigagalaxy.net Workflows Reward Sharing of Workflows
  50. 50. Visualisations & DOIs for workflows http://www.gigasciencejournal.com/series/Galaxy 50
  51. 51. Facilitate reproducibility, reuse & sharing & publish outputs of: Knitr, Sweave, Jupyter/iPython Notebook, etc. Open Documents Reward Open/Dynamic Workbooks
  52. 52. Virtual Machines/containers
  53. 53. http://dx.doi.org/10.1186/s13742-015-0087-0 :standardised containers
  54. 54. https://opensource.org/licenses
  55. 55. https://opensource.org/licenses Open Source v Open Data Licenses Same ethos (open source begat open data), different contexts • OSS designed for continuing development, OD for making objects available • IP issues. Software can be patented, data (generally) can’t • More business models for software than data (so far…) • Wider selection of OSS licenses, and more options to fine- tune access (Linking, Distribution, Modification, Sublicensing, Patents/Trademarks, etc.)
  56. 56. • Now researchers are producing such large & heterogeneous datasets, what do you think the challenges are for producers and users? • What are the legal implications of mixing data and software? • What do you think the security issues of accessing these complex combined research objects are? Questions to ask?
  57. 57. Questions? | 15 minute break
  58. 58. Research Data: Pop Quiz What was #climategate? What is the INSDC, and who are the three INSDC partners? What is the estimated yearly growth of medical imaging data? What are bioboxes? How many databases are currently listed in biosharing? Which of the reporting guidelines/checklists are for A) animals, B) biological science, and C) clinical research: MIBBI, ARRIVE and Equator
  59. 59. ETHICS & DATA SECURITY ISSUES
  60. 60. Ethics: needs approval http://www.rss.hku.hk/integrity/ethics-compliance
  61. 61. Ethics: clinical trials need registration http://www.hkuctr.com/
  62. 62. Ethics: need informed consent http://www.med.hku.hk/images/document/04research/institution/5QMH_IRB_GUIDAN CE_NOTES_FOR_THE_PREPARATION_OF_PATIENT_CONSENT.pdf Where does data sharing fit into this? WILL MY TAKING PART IN THIS STUDY BE KEPT CONFIDENTIAL? You will need to obtain the patient’s permission to allow restricted access to their medical records and to the information collected about them in the course of the study. You should explain that all information collected about them will be kept strictly confidential. A suggested form of words is: “All information which is collected about you during the course of the research will be kept strictly confidential. Any information about you which leaves the hospital/surgery will have your name and address removed so that you cannot be recognised from it.” HKU Guideline Notes - for Preparation of Subject Information Sheet & Informed Consent Form:
  63. 63. Ethics: includes animal research http://www.med.hku.hk/research/research-ethics/animal-ethics-culatr
  64. 64. Ethics: includes animal research https://www.nc3rs.org.uk/arrive-guidelines
  65. 65. Lots of tools available: anonymisation https://www.ukdataservice.ac.uk/manage-data/tools-and-templates
  66. 66. Lots of tools available: encryption https://www.brookes.ac.uk/Research/Research-ethics/Encrypting-files/
  67. 67. Lots of tools available: DAC & brokering https://blog.repositive.io/getting-data-out-of-the-ega/
  68. 68. Lots of tools available: DAC & brokering http://www.ckbiobank.org/site/
  69. 69. Lots of tools available: DAC & brokering http://www.ckbiobank.org/site/
  70. 70. Kinds of identifying information • Direct identifiers – Names, addresses, postcode information, telephone numbers or pictures • Indirect identifiers – In combination with other information, would identify e.g. information on workplace, occupation or exceptional values of characteristics like salary or age http://www.data-archive.ac.uk/create-manage/consent-ethics/anonymisation
  71. 71. De-identification #101
  72. 72. Anonymising audio-visual data • Anonymisation of audio-visual data, such as editing of digital images or audio recordings, should be done sensitively. Bleeping out real names or place names is acceptable, but disguising voices by altering the pitch in a recording, or obscuring faces by pixellating sections of a video image significantly reduces the usefulness of data. These processes are also highly labour intensive and expensive. • If confidentiality of audio-visual data is an issue, it is better to obtain the participant's consent to use and share the data unaltered. Where anonymisation would result in too much loss of data content, regulating access to data can be considered as a better strategy. • We urge researchers to consider and judge at an early stage the implications of depositing materials containing confidential information and to get in touch to consult on any potential issues. https://www.ukdataservice.ac.uk/manage-data/legal- ethical/anonymisation/qualitative
  73. 73. Considerations for medical imaging https://openfmri.org/de-identification/ https://sourceforge.net/projects/privacyguard/ Need to also ensure DICOM (Digital Imaging and Communications in Medicine) metadata also passes through de- identification toolkit MRI brain scans first undergo skull stripping Automated Defacing Tools required beyond this
  74. 74. Considerations for medical images https://bmcmedgenet.biomedcentral.com/articles/10.1186/1471-2350-15-21 https://bmcmedgenet.biomedcentral.com/articles/10.1186/1471-2350-11-26 • Sharing of clinical images crucial in understanding “phenotypes” • Require ”consent to publish”, but challenges doing this with ill people, children, elderly, and disadvantaged • Further challenges in era of social media, open access and wikipedia • Security issues protecting signed consent forms
  75. 75. Not just a metadata problem… http://science.sciencemag.org/content/339/6117/321
  76. 76. Extra considerations for HK Hospital Authority restrictions on data • Have to apply to Hospital Authority to access public health data • Only approved 14 data requests (as of May 2016) • If approved requires data recovery charges (collect $250,000 HKD a year from this) • Can publish aggregate/summary data in journals, but not share data • Only approves academic use, not citizens or industry/pharma Via FOI request: https://accessinfo.hk/en/request/request_for_statistics_on_data_c
  77. 77. Extra considerations for China Human genetic data needs MOST approval Article 2: The term "human genetic resources" in the Measures refers to the genetic materials such as human organs, tissues, cells, blood specimens, preparations of any types or recombinant DNA constructs, which contain human genome, genes or gene products as well as to the information related to such materials. Second, any international collaborative project involving Chinese human genetic resources, for example international research cooperation and exporting human genetic resources or taking such resources outside of the territory of China should shall apply to MOST for examination and approval prior to entering into an official contract. And Chinese collaborating party shall be responsible for going through the due formalities of application for approval. (See Article 11)
  78. 78. http://www.chinadaily.com.cn/china/2010-08/12/content_11141879.htm Foshan, 2010 Extra considerations for China
  79. 79. Can this data be easily de-identified & shared? http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152381 “The individual in this manuscript has given written informed consent (as outlined in PLOS consent form) to publish their images. Following approval by the Institutional Review Board (IRB) of The University of Hong Kong and Hospital Authority Hong Kong West Cluster (UW 14– 159); 20 individuals, 10 male and 10 female volunteers, were properly instructed and gave consent to participate in this study by signing the appropriate informed consent paperwork. “
  80. 80. FAIR or unfair? Principled publishing for data. 公平 或者不公平? 数据发表的原则
  81. 81. What is FAIR (公平的)? Adverb Without cheating or trying to achieve unjust advantage. ‘no one could say he played fair’ Adjective Treating people equally without favouritism or discrimination. ‘the group has achieved fair and equal representation for all its members’ ‘a fairer distribution of wealth’ fair /fɛː/
  82. 82. 475, 267 (2011) http://www.nature.com/news/2011/110720/full/475267a.html “Wide distribution of information is key to scientific progress, yet traditionally, Chinese scientists have not systematically released data or research findings, even after publication.“ “There have been widespread complaints from scientists inside and outside China about this lack of transparency. ” “Usually incomplete and unsystematic, [what little supporting data released] are of little value to researchers and there is evidence that this drives down a paper's citation numbers.” Is this FAIR? 这是FAIR?
  83. 83. FAIR questions to ask? Is the raw data publically available? Are the reagents (plasmids, cells, antibodies, etc.) available? Are detailed protocols available? Can I access the processed data & results (supporting the figures)? Was this all available BEFORE publication to the peer reviewers? Can I inspect the peer reviews? Can I publish/link +/-ve replication experiments to this?
  84. 84. A more FAIR approach: Open Data?
  85. 85. Research Objects: a concept & model http://www.researchobject.org/ • Supporting publication of more than just PDFs, making data, code, & other resources first class citizens of scholarship. • Recognizing that there is often a need to publish collections of these resources together as one shareable, cite-able resource. • Enriching these resources and collections with any & all additional information required to make research reusable, & reproducible!
  86. 86. Importance of metadata: context (& discoverability) https://library.stanford.edu/research/data-management-services/data-best-practices/best-practices-file-naming https://twitter.com/AlisonMcNab/status/751375987624009728/photo/1 ?
  87. 87. Novel tools/formats for data interoperability/handling: ISA Importance of metadata: context (& discoverability)
  88. 88. Where do you set it? Experiment (e.g. International Cancer Genome Consortium) Datasets (e.g. cancer type) Sample (e.g. specimen xyz) e.g. doi:10.5524/100001 e.g. doi:10.5524/100001-2 e.g. doi:10.5524/100001-2000 or doi:10.5524/100001_xyz Smaller still? Importance of granularity Papers Data/ Micropubs NanopubsFacts/Assertions (~1013 in literature)
  89. 89. Importance of granularity http://www.nature.com/ng/journal/v43/n4/full/ng.785.html
  90. 90. Importance of granularity http://www.nature.com/ng/journal/v43/n4/full/ng.785.html
  91. 91. Assertion Nanopublication URL Provenance PublicationInfo assertio n opm: was Derived From opm: wasGene- ratedBy this nanopub dcterms: created pav: authored- By associa- tion a sio:statis- ticalAssociation sio:has- measurem entValue Association_1_ p_value a Sio:probability- value sio:has-value 6.56e-5 ^^xsd:float sio: refers-to dcterms: DOI … Integrity Key An Individual association between concepts: • statement or declaration • measurement • hypothetical inference • quantitative or qualitative Guarantee immutability after publication Unique, persistent and resolvable identifier How this assertion came to be, methods, evidence, context, etc. • Detailed attribution for authors, institutions, lab technicians, curators • License info • Publication date A nanopub represents structured data along with its provenance in a single publishable & citable entity. http://nanopub.org/
  92. 92. Lots of models/standards/guidelines Where does that leave us? ? 5★ open data
  93. 93. A mnemonic to remember: FAIR 一个帮助记忆的词语:FAIR http://www.nature.com/articles/sdata201618 http://www.datafairport.org/ Findable 可发现的 Accessible 可得到的 Interoperable能共同使用的 Reusable 可以再度使用的 Lots of models/standards/guidelines Where does that leave us?
  94. 94. A mnemonic to remember: FAIR http://www.nature.com/articles/sdata201618 http://www.datafairport.org/
  95. 95. A mnemonic to remember: FAIR http://www.nature.com/articles/sdata201618 http://www.datafairport.org/ To be Findable: F1. (meta)data are assigned a globally unique and persistent identifier F2. data are described with rich metadata (defined by R1 below) F3. metadata clearly and explicitly include the identifier of the data it describes F4. (meta)data are registered or indexed in a searchable resource
  96. 96. A mnemonic to remember: FAIR http://www.nature.com/articles/sdata201618 http://www.datafairport.org/ To be Accessible: A1. (meta)data are retrievable by their identifier using a standardized communications protocol A1.1 the protocol is open, free, and universally implementable A1.2 the protocol allows for an authentication and authorization procedure, where necessary A2. metadata are accessible, even when the data are no longer available
  97. 97. A mnemonic to remember: FAIR http://www.nature.com/articles/sdata201618 http://www.datafairport.org/ To be Interoperable: I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. I2. (meta)data use vocabularies that follow FAIR principles I3. (meta)data include qualified references to other (meta)data
  98. 98. A mnemonic to remember: FAIR http://www.nature.com/articles/sdata201618 http://www.datafairport.org/ To be Reusable: R1. meta(data) are richly described with a plurality of accurate and relevant attributes R1.1. (meta)data are released with a clear and accessible data usage license R1.2. (meta)data are associated with detailed provenance R1.3. (meta)data meet domain-relevant community standards
  99. 99. Beyond a mnemonic: FAIR ecosystems FAIRifier tool
  100. 100. Beyond a mnemonic: FAIR ecosystems • A particular class of FAIR Data System to provide support for data interoperability; • Supports publication, search and access to FAIR data. • Fosters an ecosystems of applications and services; • Federated architecture: different FAIRports (and other FAIR Data Systems) are interconnectable; • Supports citations of datasets and data items; • Provides metrics for data usage and citation; A ‘FAIRpoint or FAIRport’ can be any specific data instance following FAIR data principles. http://www.datafairport.org/
  101. 101. Beyond a mnemonic: FAIR ecosystems http://www.datafairport.org/ ?
  102. 102. Beyond a mnemonic: FAIR ecosystems https://www.fair-access.net.au/fair-statement “By 2020, Australian publicly funded researchers and research organisations will have in place policies, standards and practices to make publicly funded research outputs findable, accessible, interoperable and reusable.”
  103. 103. DTL/ELIXIR-NL “Bring Your Own Data Party” GigaScience/BGI HK Metabolomics ISA-TAB athon v More FAIR mnemonics: “BYODs”
  104. 104. FAIR Data in the wild Taking a microscope to the publication process
  105. 105. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0127612
  106. 106. How FAIR can we get? 如何获取FAIR? Data sets Analyses Open-Paper Open-Review DOI:10.1186/2047-217X-1-18 >50,000 accesses & 885 citations Open-Code 7 reviewers tested data in ftp server & named reports published DOI:10.5524/100044 Open-Pipelines Open-Workflows DOI:10.5524/100038 Open-Data 78GB CC0 data Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/ >40,000 downloads Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2
  107. 107. Can we reproduce results? SOAPdenovo2 S. aureus pipeline
  108. 108. The SOAPdenovo2 Case study Subject to and test with 3 models: Data Method/Experi mental protocol Findings Types of resources in an RO ISA-TAB/ISA2OWL Nanopublication Wfdesc/ISA- TAB/ISA2OWL Models to describe each resource type
  109. 109. 1. While there are huge improvements to the quality of the resulting assemblies, other than the tables it was not stressed in the text that the speed of SOAPdenovo2 can be slightly slower than SOAPdenovo v1. 2. In the testing an assessment section (page 3), based on the correct results in table 2, where we say the scaffold N50 metric is an order of magnitude longer from SOAPdenovo2 versus SOAPdenovo1, this was actually 45 times longer 3. Also in the testing an assessment section, based on the correct results in table 2, where we say SOAPdenovo2 produced a contig N50 1.53 times longer than ALL-PATHS, this should be 2.18 times longer. 4. Finally in this section, where we say the correct assembly length produced by SOAPdenovo2 was 3-80 fold longer than SOAPdenovo1, this should be 3-64 fold longer.
  110. 110. Lessons Learned 经验教训 • Most published research findings are false. Or at least have errors • With enough effort is possible to push button(s) & recreate a result from a paper with current tools • Being FAIR can be COSTLY. How much are you willing to spend? Who will build FAIR infrastructure? • Much easier to make things FAIR before rather than after publication. BYODs useful intermediate here
  111. 111. http://www.nature.com/ng/journal/v48/n4/full/ng.3544.html “The question to ask in order to be a data steward, to handle data or to simplify a set of standards is the same: “is it FAIR”?”
  112. 112. http://content.iospress.com/articles/information-services-and-use/isu824
  113. 113. Levels of FAIRness: A-F of FAIR data In class activity: How FAIR is this data? 1. Data from: Live poultry exposure and public response to influenza A(H7N9) in urban and rural China during two epidemic waves in 2013-2014 http://hub.hku.hk/cris/dataset/dataset93128 1. Supporting data for "Genomic analyses revealFAM84B and the NOTCH pathway are associated with the progression of esophageal squamous cell carcinoma” http://dx.doi.org/10.5524/100181 1. Linked Drug-Drug Interactions (LIDDI) https://datahub.io/dataset/linked-drug-drug-interactions-liddi http://content.iospress.com/articles/information-services-and-use/isu824
  114. 114. Reflection: how fair is FAIR? Read the FAIR principles paper. Do you think they are applicable and feasible for HK? If it is feasible, what is needed to implement them? http://www.nature.com/articles/sdata201618
  115. 115. Any questions? Does anyone have BYO data for the curation/cleaning workshop?
  116. 116. Final Project • For the final project for this course, you can choose from 3 assignment options. • The assignment is due on the 15th May and it is worth 40% of your grade. • Time will be set aside for presenting on this during the final class on the 24th April: covering why you chose the option, what discipline/dataset/topic you are covering, and what work you've done so far (5 mins per student including any group feedback)
  117. 117. Final Project: Option 1 Write an Annotated Bibliography about data curation practices in an academic discipline of your choosing. • Choose a discipline (sciences, social sciences, & humanities) OR choose the topic of “open data.” • Summarize data practices in your chosen discipline or topic. (5-7 sentences) • Find 7-10 sources that relate that discipline or topic to data creation, management, and/or curation. • Provide a citation for the source in APA style. • Write a short annotation that summarizes the content of the source. You may include quotes from the source sparingly, but the annotations should be mostly, if not entirely, in your own words. (3-5 sentences) • Explain the relevance of the source with relation to the data practices of your chosen discipline or topic. (1-2 sentences) • Find a few example public datasets to demonstrate the above points. Cite the data in the relevant places in the Bibliography according to the Data Citation Principles. • Refer to this guide for more information about annotated bibliographies: http://sites.umuc.edu/library/libhow/bibliography_tutorial.cfm. Your annotation should be in the “Descriptive” style.
  118. 118. Final Project: Option 2 Using a relevant dataset (this can either be from the literature curation exercise, a BYO dataset, or one given to you), write a report that includes a description of the dataset, a Data Management Plan, and a guidelines document for the researcher(s). • Describe the dataset that explains the form of the data and the academic discipline in which it was created. This paragraph should provide context for the (3-5 sentences) 1-2 page Data Management Plan following the guidelines from HKU or granting body such as NSF. • 1 page guidelines document that could be presented to the researcher(s) that provides guidelines for their data (extant and forthcoming): – Preservation – Appraisal – Documentation • For the DMP and the guidelines document, you can extrapolate from the your dataset to imagine additional details about the research practices that created the dataset and will create more data in the future. • Look for suitable data repositories that can host this data (institutional, general purpose, or subject specific), and if there is one relevant then publish the data if you have permission, and correctly cite the data in the relevant places in your report. [disclaimer: if have permission]
  119. 119. Final Project: Option 3 Prepare a 30 minute data curation workshop that you could teach to researchers that would provide them the necessary details to understand why data curation is relevant to them and best practices they should follow. • Slide deck that introduces data curation for a researcher audience. (No more than 40 slides.) • Presenter outline that describes the important points for each slide. • Topics that might be addressed in your workshop: the value of data management, writing a data management plan, data repository options. You can assume your audience is researchers are at HKU. • Make sure all of the content is copyright free, and share the final material openly (e.g. figshare, scholarhub, OER commons, etc.), and with sufficient metadata to make it discoverable.
  120. 120. Looking ahead… • Submit 1 paragraph refection on FAIR principles through moodle forum • Next class (22nd April) is hands-on curation workshop with Dr Chris Hunter – Bring laptops and any data you may have for a data cleaning exercise • Final project due 15th May – Need to present preliminary version on 26th April to get feedback before completion. Send me slides by the 25th April so I can get them ready for the class

×