Democratising Data Publishing: A Global
Perspective
Dr Chris Armit
Data Scientist, GigaScience
BGI-Hong Kong
Need for FAIR (high quality) Open Data
Enables
• Using networking power of the internet to tackle problems
• Can ask new questions & find hidden patterns & connections
• Build on each others efforts quicker & more efficiently
• More collaborations across more disciplines
• Harness wisdom of the crowds: crowdsourcing, citizen science,
crowdfunding
Global Challenges
• Quick response to climate change, food security & disease outbreaks
• Cultural & technical hurdles need to be overcome
Global Challenges
https://www.nature.com/articles/d41586-018-07244-w
• How will Open Access (APC) models work here?
• Authors are unlikely to be able to afford article processing charges
http://www.nature.com/news/data-sharing-make-outbreak-research-open-access-1.16966
Cultural Hurdles in Publishing Research Data
Example: Disease outbreaks
• Genome sequences from the West Africa outbreak of Ebola were first made
publicly available in April 2014
• Datasets were released sporadically when this became a hot research topic
• This led to gaps in the data
Democratising Data at GigaScience
• GigaScience integrates and publishes all research objects to
maximise reproducibility, transparency and reuse
• GigaDB enables rapid publication of data associated with a
GigaScience manuscript
• GigaDB DOIs incentivise early release of data/code/etc.
• Data
• Software
• Models
• Pipelines
• Reviews
• E. Coli O104:H4 isolate TY-2482 in
Germany, >50 died, June 2011
• Crisis, mass panic, data needed
• BGI working with Hamburg University
let us share the data CC0 with our
first data DOI from GigaDB.
• Released via twitter
• Did not know consequences of early release of data
• These data were considered of such great importance that we did not wish
to wait for publication
Example: Disease outbreaks
http://dx.doi.org/10.5524/100001
Democratising Data at GigaScience
Democratising Data at GigaScience
• From Big Data to usable Data
• Example: WebTools for easy browsing and visualisation
• Pan-and-zoom map browser as a visual aid to allow the end user to
find datasets
• Pan-and-zoom map browser as a visual aid to allow the end user to
find datasets
Democratising Data at GigaScience
• From Big Data to usable Data
• Example: WebTools for easy browsing and visualisation
• 3D viewer allows users to interact and explore image data prior to data
download
• 3D models are CC0, can be downloaded, and are printable
Democratising Data at GigaScience
• From Big Data to usable Data
• Example: WebTools for easy browsing and visualisation
Democratising Data at GigaScience
• Widening the target audience
• Bioinformaticians and ‘Big Data’ scientists are a
primary target audience
• Plugins and visualisations make access easier for the
less technically inclined
• Democratises access through education potential
and ease of use
Democratising Data at GigaScience
Difficulties we have encountered…
• Internet, i.e. Bandwidth, unstable connections,
occasionally US institutions blocking Chinese IP
addresses, China blocking google/dropbox links
• Copying 10GB of data from South Africa took >1month
because of powercuts
• Email communication difficulties due to spam filters.
• Data access agreements (clinical data)
Democratising Data at GigaScience
• Example: Food security
• Rice, Oryza sativa L., is the
staple food for half the
world’s population
• By 2030, rice production
must increase by at least
25% to keep pace with
population growth
Democratising Data at GigaScience
Rice 3K project
• 3,000 rice genomes
• 13.4TB public data
• 6 months to copy
data to Sequence
Read Archive (SRA)
• Data published 4
years before
analysis published
From Big Data to usable(ish) Data
• Although 13TB data in GigaDB was open (CC0), after analysing in
Tianhe supercomputer processed rice3K data = 100TB
• AWS hosted for free, but expensive to process
https://aws.amazon.com/public-data-sets/3000-rice-genome/
Processed data finally published 1st May 2018, Nature v557, p43–49
https://www.nature.com/articles/s41586-018-0063-9
Democratising Data at GigaScience
• Example: Food security
• The African Orphan Crop
Consortium (AOCC) is
developing genomic
resources for 101 crops that
represent a significant part
of African/Asian diets.
• To-date, the AOCC working
on 69 genomes, 5 of which
are published in GigaDB.
Hyacinth bean
• Stunting: Physical, Neurological, Economic
Growing Africa Out of Stunting, Hunger & Malnutrition:
The African Orphan Crops Consortium
• Provide genomic tools to accelerate breeding in 101 crops
important to African Diets
• Define genetic diversity in 100 lines/species
• Train 150 top African plant breeders to use the latest strategies
and technologies in plant breeding
African Orphan Crops Consortium (AOCC)
Courtesy: AOCC
African Orphan Crops Consortium (AOCC)
23
Democratising Data at GigaScience
• Each AOCC genome is a single GigaDB dataset (with DOI)
Democratising Data at GigaScience
• From Big Data to usable Data
• Example: Easy-to-use plug and play RiceGalaxy
• Processed data and software tools made freely available
• GUI means plant breeders can utilise genetic data without coding skills
• Funded to run at low cost (<100 USD/month) via AWS Singapore & local
servers (2 vCPUs, 8GB RAM, 2 mounted volumes, 200GB total storage)
• CGIAR Excellence in Plant Breeding Platform/model will roll out to other
crops
Democratising Data at GigaScience
• From Big Data to usable Data
• Example: Easy-to-use plug and play RiceGalaxy
• GUI means plant breeders can utilise genetic data without coding skills
• Funded to run at low cost (<100 USD/month) via AWS Singapore & local
servers (2 vCPUs, 8GB RAM, 2 mounted volumes, 200GB total storage)
• CGIAR Excellence in Plant Breeding Platform/model will roll out to other
crops
Courtesy: IRRI
Courtesy: AOCC
Acknowledgements
Laurie Goodman, Editor in Chief
Scott Edmunds, Executive Editor
Chris Hunter, GigaDB Lead BioCurator
Mary Ann Tuli, GigaDB Data Editor
Xiao (Jesse) Si Zhe, Database Developer
Nicole Nogoy, Editor
Hans Zauner, Assistant Editor
Hongling Zhao, Assistant Editor
Peter Li, Lead Data Manager
Chen Qi, Shenzhen Office.
@GigaScience
facebook.com/GigaScience
http://gigasciencejournal.com/blog/
www.gigasciencejournal.com
www.gigadb.org
+
Weibo
& WeChat

Chris Armit at IDW2018: Democratising Data Publishing: A Global Perspective

  • 1.
    Democratising Data Publishing:A Global Perspective Dr Chris Armit Data Scientist, GigaScience BGI-Hong Kong
  • 2.
    Need for FAIR(high quality) Open Data Enables • Using networking power of the internet to tackle problems • Can ask new questions & find hidden patterns & connections • Build on each others efforts quicker & more efficiently • More collaborations across more disciplines • Harness wisdom of the crowds: crowdsourcing, citizen science, crowdfunding Global Challenges • Quick response to climate change, food security & disease outbreaks • Cultural & technical hurdles need to be overcome
  • 3.
    Global Challenges https://www.nature.com/articles/d41586-018-07244-w • Howwill Open Access (APC) models work here? • Authors are unlikely to be able to afford article processing charges
  • 4.
    http://www.nature.com/news/data-sharing-make-outbreak-research-open-access-1.16966 Cultural Hurdles inPublishing Research Data Example: Disease outbreaks • Genome sequences from the West Africa outbreak of Ebola were first made publicly available in April 2014 • Datasets were released sporadically when this became a hot research topic • This led to gaps in the data
  • 5.
    Democratising Data atGigaScience • GigaScience integrates and publishes all research objects to maximise reproducibility, transparency and reuse • GigaDB enables rapid publication of data associated with a GigaScience manuscript • GigaDB DOIs incentivise early release of data/code/etc. • Data • Software • Models • Pipelines • Reviews
  • 6.
    • E. ColiO104:H4 isolate TY-2482 in Germany, >50 died, June 2011 • Crisis, mass panic, data needed • BGI working with Hamburg University let us share the data CC0 with our first data DOI from GigaDB. • Released via twitter • Did not know consequences of early release of data • These data were considered of such great importance that we did not wish to wait for publication Example: Disease outbreaks http://dx.doi.org/10.5524/100001 Democratising Data at GigaScience
  • 11.
    Democratising Data atGigaScience • From Big Data to usable Data • Example: WebTools for easy browsing and visualisation • Pan-and-zoom map browser as a visual aid to allow the end user to find datasets
  • 12.
    • Pan-and-zoom mapbrowser as a visual aid to allow the end user to find datasets Democratising Data at GigaScience • From Big Data to usable Data • Example: WebTools for easy browsing and visualisation
  • 13.
    • 3D viewerallows users to interact and explore image data prior to data download • 3D models are CC0, can be downloaded, and are printable Democratising Data at GigaScience • From Big Data to usable Data • Example: WebTools for easy browsing and visualisation
  • 14.
    Democratising Data atGigaScience • Widening the target audience • Bioinformaticians and ‘Big Data’ scientists are a primary target audience • Plugins and visualisations make access easier for the less technically inclined • Democratises access through education potential and ease of use
  • 15.
    Democratising Data atGigaScience Difficulties we have encountered… • Internet, i.e. Bandwidth, unstable connections, occasionally US institutions blocking Chinese IP addresses, China blocking google/dropbox links • Copying 10GB of data from South Africa took >1month because of powercuts • Email communication difficulties due to spam filters. • Data access agreements (clinical data)
  • 16.
    Democratising Data atGigaScience • Example: Food security • Rice, Oryza sativa L., is the staple food for half the world’s population • By 2030, rice production must increase by at least 25% to keep pace with population growth
  • 17.
    Democratising Data atGigaScience Rice 3K project • 3,000 rice genomes • 13.4TB public data • 6 months to copy data to Sequence Read Archive (SRA) • Data published 4 years before analysis published
  • 18.
    From Big Datato usable(ish) Data • Although 13TB data in GigaDB was open (CC0), after analysing in Tianhe supercomputer processed rice3K data = 100TB • AWS hosted for free, but expensive to process https://aws.amazon.com/public-data-sets/3000-rice-genome/
  • 19.
    Processed data finallypublished 1st May 2018, Nature v557, p43–49 https://www.nature.com/articles/s41586-018-0063-9
  • 20.
    Democratising Data atGigaScience • Example: Food security • The African Orphan Crop Consortium (AOCC) is developing genomic resources for 101 crops that represent a significant part of African/Asian diets. • To-date, the AOCC working on 69 genomes, 5 of which are published in GigaDB. Hyacinth bean
  • 21.
    • Stunting: Physical,Neurological, Economic Growing Africa Out of Stunting, Hunger & Malnutrition: The African Orphan Crops Consortium
  • 22.
    • Provide genomictools to accelerate breeding in 101 crops important to African Diets • Define genetic diversity in 100 lines/species • Train 150 top African plant breeders to use the latest strategies and technologies in plant breeding African Orphan Crops Consortium (AOCC) Courtesy: AOCC
  • 23.
    African Orphan CropsConsortium (AOCC) 23
  • 24.
    Democratising Data atGigaScience • Each AOCC genome is a single GigaDB dataset (with DOI)
  • 25.
    Democratising Data atGigaScience • From Big Data to usable Data • Example: Easy-to-use plug and play RiceGalaxy • Processed data and software tools made freely available • GUI means plant breeders can utilise genetic data without coding skills • Funded to run at low cost (<100 USD/month) via AWS Singapore & local servers (2 vCPUs, 8GB RAM, 2 mounted volumes, 200GB total storage) • CGIAR Excellence in Plant Breeding Platform/model will roll out to other crops
  • 26.
    Democratising Data atGigaScience • From Big Data to usable Data • Example: Easy-to-use plug and play RiceGalaxy • GUI means plant breeders can utilise genetic data without coding skills • Funded to run at low cost (<100 USD/month) via AWS Singapore & local servers (2 vCPUs, 8GB RAM, 2 mounted volumes, 200GB total storage) • CGIAR Excellence in Plant Breeding Platform/model will roll out to other crops
  • 27.
  • 28.
  • 29.
    Acknowledgements Laurie Goodman, Editorin Chief Scott Edmunds, Executive Editor Chris Hunter, GigaDB Lead BioCurator Mary Ann Tuli, GigaDB Data Editor Xiao (Jesse) Si Zhe, Database Developer Nicole Nogoy, Editor Hans Zauner, Assistant Editor Hongling Zhao, Assistant Editor Peter Li, Lead Data Manager Chen Qi, Shenzhen Office. @GigaScience facebook.com/GigaScience http://gigasciencejournal.com/blog/ www.gigasciencejournal.com www.gigadb.org + Weibo & WeChat

Editor's Notes

  • #4 Up to $5000-6000 USDs
  • #18 Quadrupled data in the public domain. Data publication 4 years before analysis published in Nature
  • #20 Quadrupled data in the public domain. Data publication 4 years before analysis published in Nature