Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HKU Data Curation MLIM7350 Class 9


Published on

Chris Hunter's slides from Class 9 of the HKU Data Curation course (MLIM7350) giving a biocurators perspective of data curation.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

HKU Data Curation MLIM7350 Class 9

  1. 1. Data Curation: A BioCurators perspective. Chris Hunter 21 April 2017
  2. 2. Session structure • Introductions: – A bit about me, a bit about you, House keeping, What is GigaDB • (Meta)Data Handling – Curation, BioCuration, Sharing data • BioCuration Life Cycle and tools – Dictionaries, CVs, spreadsheets, standards and checklists • OpenRefine practical
  3. 3. Session structure • Introductions: – A bit about me, a bit about you, House keeping, What is GigaDB • (Meta)Data Handling – Curation, BioCuration, Sharing data • BioCuration Life Cycle and tools – Dictionaries, CVs, spreadsheets, standards and checklists • OpenRefine practical
  4. 4. Communicating in-class • Chat channel: • Feel free to ask questions, requests to speed up/slow down • The example files & slides available here: Also feel free to email:
  5. 5. This is me • LinkedIN: • ORCID ID:
  6. 6. My background • Applied Biology Degree (Nottingham, UK) • Genetics/Genomics PhD (Cambridge, UK) • Postdoc – function of small DNA motifs • Postdoc – Cancer Genome Project • EBI – Curator for SRA • EBI – Bioinformatician/Curator on Metagenomics portal • GigaScience Database – Lead BioCurator 95-99 99-03 03-04 04-07 07 -09 09-12 13- present
  7. 7. Why tell you about me? • An indication of what qualifies me to be teaching you about curation! • The sort of person that you might meet in the role of BioCurator • To show that you don’t need to know your end goal to make a career, just make the most of opportunities.
  8. 8. Who are you? • I would like to take a few minutes to hear from each of you (~30secs each) • Name • Background • Scientific/academic interests • Any idea whats next for your career?
  9. 9. Questions?
  10. 10. WHAT IS GIGADB?
  11. 11. GigaScience journal • GigaScience is an OPEN access publisher of Life Science articles • Highly reproducible articles • Focus on Big data • Peer reviewed for reliability • Provide open access free to all • Run as a not-for-profit to best benefit researchers
  12. 12. What makes us different? • GigaDB
  13. 13. What is GigaDB? • Open access database • Data organized into datasets • Datasets associated to GigaScience articles • Manually curated • Indexed and searchable metadata enabling discoverability and reuse.
  14. 14. • Currently >300 datasets available • Genomic datasets represent majority of data(~55%) • ~75% of all data from BGI (or collaborators) • ~20 different data types represented • All manually curated
  15. 15. Data types • Nucleotide: – Genomic, Transcriptomic, Metagenomic, • Mass spectrometry: – Proteomics, Metabolomics, MS-Imaging. • Software & Workflows • Other – Imaging, Neuroscience, Network analysis
  16. 16.
  17. 17. Anatomy of a GigaDB entry • All relevant information is held together in packets called Datasets • Each dataset has a stable DOI page • If required there can be a hierarchy of datasets
  18. 18. • Title • Study type(s) • Image • Citation • Description • Funders • Links to Google scholar and EuroPMC to see who has cited this dataset • Email submitter • Link to manuscript • Links to external resources Cont.
  19. 19. • Samples used in the study • Files listed as part of the study • History of dataset changes • Social media links • Links to other datasets of similar nature
  20. 20. Downloading the data FTP • Conventional/easy to use • Can pull individually from web page • 1 or multiple files using command line unix • Speed = upto 1 Mb/sec
  21. 21. Questions?
  22. 22. Session structure • Introductions: – A bit about me, a bit about you, House keeping, What is GigaDB • (Meta)Data Handling – Curation, BioCuration, Sharing data • BioCuration Life Cycle and tools – Dictionaries, CVs, spreadsheets, standards and checklists • OpenRefine practical
  24. 24. What is data? • “Data may exist only in the eye of the beholder: The recognition that an observation, artifact, or record constitutes data is itself a scholarly act.” (Borgman, 2012) Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
  25. 25. What is data? • We use the term “data” to be broadly inclusive. It includes – digital manifestations of literature – laboratory data: including spectrographic, genomic sequencing, and electron microscopy data – observational data: remote sensing, geospatial, and socioeconomic data – other forms of data either generated or compiled, by humans or machines: software, scripts, intermediary data, tabular data used to generate charts
  26. 26. How data is created • Gathered or produced by researchers – Observations, experiments, or models – Survey results – Records (census, economic, etc.) – Digitized/born digital text and images
  27. 27. So what is metadata? • Data ABOUT data • a set of data that describes and gives information about other data. – • Its not a new concept, think about old catalog cards WikiData: Tomwsulcer
  28. 28. Curate the data • To classify and catalog data • Metadata is the classification and cataloging of data to aid discoverability and reuse. • Strongly reliant on controlled vocabularies and ontological terms
  29. 29. Data curation is… • “the active and ongoing management of data throughout its entire lifecycle of interest and usefulness to scholarship” Cragin et al., 2007 • I would also add: “the cataloging of data to increase its usefulness”
  30. 30. Data curation… • Is a dynamic process – Not a one time, or one step activity • Happens in a lifecycle – Creation, management, preservation • Aims to maintain the utility of the data
  31. 31. What gets curated? • Data – At various stages • Methods (sometimes) – Algorithms, code • Metadata – Information about the data • Links – metadata can form networks of linked data to help knowledge acquisition
  32. 32. Data curation or BioCuration? • Distinct, but related • Data curation is broader • BioCuration is more specific to Biological data curation – “Biocuration involves the translation and integration of information relevant to biology into a database or resource that enables integration of the scientific literature as well as large data sets. ”
  33. 33. BioCuration • The process of curating biological data • International Society of BioCuration (ISB) – Yearly meetings – Society website ( – Discussion forum – Job adverts
  34. 34. BioCuration2018 - Shanghai
  35. 35. SHARING DATA
  36. 36. Why share data? • Concepts related to the scientific method • Reproducibility: – Experiment can be replicated by the original researcher or another researcher • Reliability: – Similar results can be achieved in other experiments • Re-use – Others can make use of data in other ways than originally intended
  37. 37. What’s important? • An attractive, tabular lay-out in a spreadsheet for presentational purposes? • An accessible version that is suitable for re-use with minimal editing? • Both of the above? – Consider releasing multiple formats of your data
  38. 38. Manuscripts • The traditional publication is “presentational” version of the data, – often lurking in supplemental files as PDF’s
  39. 39. Data Journals • Publication option for datasets – Often discipline-specific – Can be peer-reviewed • Sometimes provide a means of useable data release, or sometime just an independently citable version of supplemental files.
  40. 40. Data Repositories • Where data is stored for the long term • Computer accessible • Some repositories are discipline-specific – Genomic data: GenBank / ENA • Some repositories are built for an organization – For a university / institute – For a funder – Not-for-profit (Dryad, Figshare, GigaDB, Zonodo)
  41. 41. FYI: GigaScience is… • Combination of – Peer reviewed Manuscript publication linked to a – Manually curated Data repository
  42. 42. Session structure • Introductions: – A bit about me, a bit about you, House keeping, What is GigaDB • (Meta)Data Handling – Curation, BioCuration, Sharing data • BioCuration Life Cycle and tools – Dictionaries, CVs, spreadsheets, standards and checklists • OpenRefine practical
  44. 44. (Primary) BioCuration activities • Documentation – Keeping track of how the data was: • Generated; used; analyzed • Annotation – Addition of structured information to accompany data/files • Connection – Linking of files/data to related items both within dataset and to external items
  45. 45. (Ancilliary) BioCuration activities • Collection and aggregation – Files in directories; databases • Storage and archiving – Saving data (on digital media) – Providing consistent and permanent identifiers (DOI) • Migration – Active preservation of data to keep it readable • Repeat the process on an ongoing basis
  46. 46. The BioCurators tools • Ontologies / CV’s / Dictionaries • key:value pairs, RDF/triplestores
  47. 47. Dictionary • An alphabetical reference list of terms or names important to a particular subject or activity along with discussion of their meanings and applications • Casrai – Particularly IRIDIUM (Research Data Management): • – Many other dictionaries maintained by Casra
  48. 48. Controlled Vocabularies • A controlled vocabulary is an organized arrangement of words and phrases used to index content • Can be a subset of a dictionary
  49. 49. Key:value pairs • A key-value pair (KVP) is a set of two linked data items: a key, which is a unique identifier for some item of data, and the value, which is either the data that is identified or a pointer to the location of that data. • Structured pairing of particular terms, • one or both can be from CV’s • Particularly used for computer readable matadata
  50. 50. Ontologies • a set of concepts and categories in a subject area or domain that shows their properties and the relations between them. • More complex than CV’s includes relationship information and inherited concepts • Most ontologies in common use in BioCuration are infact hierarchical CVs • Much work is being done to integrate, merge and unify many of these into a true ontology which will enable symantic web applications.
  51. 51. RDF (Resource Description Framework • a model for encoding semantic relationships between items of data so that these relationships can be interpreted computationally. • A complete extrapolation of all ontologies to include all CV’s with dictionary definitions and links to all related terms • Entirely computer readable using URIs
  52. 52. Questions? Reminder for Chris: Its probably about time for a break!
  53. 53. The BioCurators tools(2) • Ontologies / CV’s / Dictionaries • key:value pairs, RDF/triplestores, • tools for handling metadata (Excel, CSV, OpenRefine)
  54. 54. Whats good about spreadsheets? • Most people are familiar with them • No programing skills required • Can be used to make data look pretty (highlighting, different fonts, etc) • Are forgiving of non-data cells (e.g. comments)
  55. 55. Whats bad about spreadsheets? • They allow merging of cells & other odd formatting to appeal to the eye. • Dates (reformatted) • Spreadsheet programs are not appropriate for analysis/statistics. • Incompatible (native) file formats with command line software such as R • Size limitations (requires a lot of RAM to open files with millions of rows)
  56. 56. • Most people still use spreadsheet to organize there own data • Good practices with data collection can aid downstream processes
  57. 57. Using spreadsheets wisely • Useful reference – Be consistent – Write dates as YYYY-MM-DD – Fill in all of the cells – Put just one thing in a cell – Create a data dictionary (like a CV) – No calculations in the raw data files – Don’t use font colour or highlighting as data – Choose good names for things – Make backups – Save the data in plain text files
  59. 59. Hand-on part 1 (Excel) • First of three quick practical examples of BioCuration – Using Excel wisely – Exploring the DataCite XML schema – Rationalising data using OpenRefine
  60. 60. Excel • Keep in mind: • Using this file as a starting point: • ntations/MLIM.dir/sample_attribute_spread sheet-example.csv • It contains 10,000 rows of the GigaDB sample attributes table
  61. 61. Questions • Are the dates effected by being manipulated via Excel? • Do the ages all have units? • What has happened with some of the text in the first few rows?! • Are all latitude and longitude values consistent and appropriate?
  62. 62. Answers • Some dates appear as serial dates (i.e. the number of days after (or before) 1900-Jan- 01 e.g. 37074 = 2001-Jul-02 • Null dates have been converted to 0 or 1900-Jan-00 • Only 403 / 928 age values have units • The hyphen has been converted to – which is UTF8 code: – • Only 2 Lat-long values in this subset and they are both in different formats! 29.097221 -83.067351 44.000306N, 16.01625E
  63. 63. The BioCurators tools(3) • Ontologies / CV’s / Dictionaries • key:value pairs, RDF/triplestores, • tools for handling metadata (Excel, CSV, OpenRefine) • Database (SQL/MySQL etc.) • Structured computational formats (XML, JSON) • Standards
  65. 65. Standards • Examples: – Dublin core – GSC • Resources: – • Results of the use of standards: –
  66. 66. Dublin Core • “The Dublin Core metadata standard is a simple yet effective element set for describing a wide range of networked resources.” Contributor Coverage Creator Date Description Format Identifier Language Publisher Relation Rights Source Subject Title Type
  67. 67. Genomics Standard Consortium • Minimal Information about any sequence – “MIxS” * • Covers a variety of different “environmental packages” • Each recommends terms from a list of ~700 defined attributes • Each has ~10-20 mandatory attributes • MIxS is effectively a dictionary of attributes * Yilmaz, P et al. Nature Biotechnology 29, 415-420 (2011) doi:10.1038/nbt.1823
  68. 68. Example of MIxS compliant sample Standards in Genomic Sciences201611:91 DOI: 10.1186/s40793-016-0213-3 Attributes Description Actinoalloteichus hymeniacidonis DSM 45092, an actinomycete isolated from the marine sponge Hymeniacidon perleve BioProject PRJNA273752 strain HPA177(T) (=DSM 45092(T)) host Hymeniacidon perleve isolation source intertidal marine sponge from the beach of Dalian collection date 2006 geographic location China: beach of Dalian sample type pure culture biomaterial provider DSM 45092 culture collection DSM:45092 environment biome intertidal zone host tissue sampled washed sponge latitude and longitude 38.8667 N 121.6833 E Publication
  69. 69. Effective standards and checklists • Make extensive use of CVs, Ontologies and KVPs • Uptake of new standards is usually slow and requires incentives for users
  70. 70. Application Programming Interface • While webpages are human readable machine require structured data • Application Programming Interface (API)
  71. 71. Schema design • In order for machines to understand data and its relationships they need to follow a set structure (schema). • GigaDB has a fairly complex structure as a relational database
  72. 72. partially expressed in 785 lines of XSD schema for beta API
  73. 73. Schema design • In order for machines to understand data and its relationships they need to follow a set structure (schema). • GigaDB is complex • DataCite is less complicated, it’s stored in XML (the comprehensive XSD to describe it is ~500 lines)
  74. 74. DataCite • The XSD is available here: – 4.0/metadata.xsd • And described here: – 4.0/doc/DataCite-MetadataKernel_v4.0.pdf • Example are provided –
  75. 75. A simple DataCite example <resource xmlns:xsi="" xmlns=""xsi:schemaLocation=""> <identifier identifierType="DOI">10.5072/D3P26Q35R-Test</identifier> <creators> <creator> <creatorName>Fosmire, Michael</creatorName> </creator> <creator> <creatorName>Wertz, Ruth</creatorName> </creator> <creator> <creatorName>Purzer, Senay</creatorName> </creator> </creators> <titles> <title>Critical Engineering Literacy Test (CELT)</title> </titles> <publisher>Purdue University Research Repository (PURR)</publisher> <publicationYear>2013</publicationYear> <subjects> <subject>Assessment</subject> <subject>Information Literacy</subject> <subject>Engineering</subject> <subject>Undergraduate Students</subject> <subject>CELT</subject> <subject>Purdue University</subject> </subjects> <language>eng</language> <resourceType resourceTypeGeneral="Dataset">Dataset</resourceType> <version>1</version> <descriptions> <description descriptionType="Abstract"> We developed an instrument, Critical Engineering Literacy Test (CELT), which is a multiple choice instrument designed to measure undergraduate students’ scientific and information literacy skills. It requires students to first read a technical memo and, based on the memo’s arguments, answer eight multiple choice and six open-ended response questions. We collected data from 143 first-year engineering students and conducted an item analysis. The KR-20 reliability of the instrument was .39. Item difficulties ranged between .17 to .83. The results indicate low reliability index but acceptable levels of item difficulties and item discrimination indices. Students were most challenged when answering items measuring scientific and mathematical literacy (i.e., identifying incorrect information). </description> </descriptions> </resource>
  77. 77. Hand-on part 2 (DataCite) • Looking at the DataCite schema – Description: • 4.0/doc/DataCite-MetadataKernel_v4.0.pdf • What relationships do these two datacite records show?: • MLIM.dir/example_datacite_100038.xml • MLIM.dir/example_datacite_101041.xml
  78. 78. Answers • 100038.xml Is a New Version Of dataset doi:10.5524/100015 • 100038.xml Is Compiled By dataset doi:10.5524/100044 • 10.5524/101041 Continues dataset doi:10.5524/101000
  79. 79. BioCuration Life Cycle Summary • As lead BioCurator for GigaDB; I am involved in the schema design and data capture of all types of life science data behind GigaScience publications. • We receive, appraise and ingest data into GigaDB • We preserve and store data • We provide access for re-use of data • All the while attempting to maintain consistency
  80. 80. BioCuration Life Cycle Summary Helping build knowledge from data
  81. 81. THE FINAL PART
  82. 82. OpenRefine • According to “OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another” • Very useful for Curators to enable exploration (and cleaning/curation) of vast tables of metadata
  83. 83. PRACTICAL EXAMPLE 3 Rationalizing data using OpenRefine
  84. 84. OpenRefine • Download: – • Install: (for windows that just unzip it) • Run: open file “openrefine.exe” • Download example file: – ions/MLIM.dir/sample_attribute_spreadsheet- example.csv
  85. 85. Some things to try • Watch the 7 minute demo video: – WM • Common transformations – Cells to numbers – Remove trailing white space • Text Facet – Look for attribute name = “analyte” • Merge clusters – Text facet on “attribute_name”
  86. 86. Quick test • Can you find 5 problems in the “attribute_name” column? • Put some answers in the backchannel
  87. 87. There maybe others! • Alternative name = alternative names • Height = Height or length = hight = high or length • Patient = patient ID • Pool details = pooling details • Specimen voucher = specimen_voucher • Tissue = tissue type • Life stage = life stageseed
  88. 88. Looking at “value” field • Problem is >10,000 unique terms • Solution, to first facet on attribute_name • E.g. attribute_name = sex – The number of different values is 21! Can that be refined? ( I got down to 9)
  89. 89. WRAP-UP
  90. 90. Summary  I’m a BioCurator using a variety of experiences to help others publish data effectively  GigaScience is a unique publication combining the traditional manuscript with open access to underlying data via GigaDB  Biocuration is a broad field from fine details to high level metadata  The goal of curation is to enable discovery of knowledge  A variety of tools are available
  91. 91. Further reading / useful links  OpenRefine online tutorial  Excel / spreadsheet do’s and don’ts  GSC MIxS  Casrai – dictionary and standards  List of biological standards, checklists and databases
  92. 92. BioCuration2018 - Shanghai
  93. 93. Reflection: how fair is FAIR? Read the FAIR principles paper. Do you think they are applicable and feasible for HK? If it is feasible, what is needed to implement them? Reminder: Please comment in Moodle Forum. Scott will give feedback on Monday
  94. 94. Reminder: Final Project • For the final project for this course need to choose from 3 assignment options (see moodle). • The assignment is due on the 15th May and it is worth 40% of your grade. • Time will be set aside for presenting on this during the final class on the 24th April: covering why you chose the option, what discipline/dataset/topic you are covering, and what work you've done so far (5 mins per student including any group feedback) Scott needs your slides by Monday morning for 5 min presentation.
  95. 95. Looking ahead… • Final project due 10th May – Need to present preliminary version on 26th April to get feedback before completion. Send Scott slides by the 25th April so he can get them ready for the class