Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Open Data HK: open science meets open data. A primer from Scott Edmunds


Published on

Talk by Scott Edmunds on open science meets open data at ODHK.MEET.11 on the 21st November 2013 at Delaney's Hong Kong.

Published in: Technology
  • Be the first to comment

Open Data HK: open science meets open data. A primer from Scott Edmunds

  1. 1. Open science primer meets Scott Edmunds @SCEdmunds @GigaScience
  2. 2. Can this be considered open data?
  3. 3. Does this qualify as open source?
  4. 4. What is Open (Science) Data? • Something very very very geeky • Free & open access to data about the world around us Searchable, findable o Machine-readable, app-makeable, Excel-usable o Without restrictions/limitations o • This (examples)
  5. 5. About me: • Scott Edmunds • Molecular biology, sci editing & comms • Scientific journal & (big) data publishing • Reproducibility & open science Journal, data-platform and database for large-scale biological data
  6. 6. About me:
  7. 7. About my employer: • Formerly Beijing Genomics Institute • Founded in 1999 (1% of HGP) • China’s 1st citizen managed not-for-profit research institute funded by commercial sequencing-as-a-service (BGI Tech) • Now largest genomic organization in the world • HQ in Shenzhen, most data production in BGI HK (Tai Po)
  8. 8. Standing on the shoulders of giants
  9. 9. Open Data 1665? Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995
  10. 10. OKFN: 8 types of open data
  11. 11. Panton Principles =
  12. 12. Science Data Volumes Astrophysics Exabytes HE Physics 100’s of Petabytes Biology Petabytes Sequencing Square Kilometer Array Large Hadron Collider Mass Spec Imaging
  13. 13. The long tail of scientific data… Esoteric formats, poorly structured, Tabular, often spreadsheet based Issues open data community well used to (data cleaning, scraping, etc.,)
  14. 14. Open Data in Physics 1961 CERN pre-prints shelf 1991-date arXiv
  15. 15. Open Data in Biology 1934: newsletter era 1980: database era 1987: online era 2010’s: “bioinformatics bingo” era
  16. 16. BGI HK Chamber O’Illumina’s The LHC of Biology? 20PB of storage
  17. 17. Open Data in Chemistry
  18. 18. Closed Data in Chemistry
  19. 19. Genomics: open-data success story? V
  20. 20. Sharing/reproducibility helped by stability of: 1st Gen 2nd Gen 1. Platforms 1. Repositories 2. Standards :
  21. 21. Genomics Data Sharing Policies… Bermuda Accords 1996/1997/1998: 1. Automatic release of sequence assemblies within 24 hours. 2. Immediate publication of finished annotated sequences. 3. Aim to make the entire sequence freely available in the public domain for both research and development in order to maximise benefits to society. Fort Lauderdale Agreement, 2003: 1. Sequence traces from whole genome shotgun projects are to be deposited in a trace archive within one week of production. 2. Whole genome assemblies are to be deposited in a public nucleotide sequence database as soon as possible after the assembled sequence has met a set of quality evaluation criteria. Toronto International data release workshop, 2009: The goal was to reaffirm and refine, where needed, the policies related to the early release of genomic data, and to extend, if possible, similar data release policies to other types of large biological datasets – whether from proteomics, biobanking or metabolite research.
  22. 22. Sharing aids fields… Rice v Wheat: consequences of publically available genome data. rice 700 600 500 400 300 200 100 0 wheat
  23. 23. Digitizing the world Can we make everything open data?
  24. 24. NO
  25. 25. The (non-) human centipede: first sequence NO
  27. 27. NO
  28. 28. What is open science? 5 flavours: Benedikt Fecher and Sascha Friesike:
  29. 29. Democratic:
  30. 30. Biggest Challenge: Closed Access WWW.RIGHTTORESEARCH.ORG
  31. 31. Biggest Challenge: Closed Access Handful of closed access STM publishers control market Force libraries to buy “bundles” Revenue >$9B Average cost /article >$5000 USD Publishers retain copyright Prevent data mining of content Withold information from 99.9% who need it!
  32. 32. Biggest Challenge: Closed Access
  33. 33. Publishing: better than a gold mine See:
  34. 34. Increasing strain on library budgets MIT library purchases v inflation 1986-2006 400% Journal expenditure 350% 300% Percentage Change 250% 200% 150% Inflation 100% 50% 0% 1986 1988 1990 1992 1994 1996 1998 2000 2002 -50% Year Consumer Price Index % + Serial Expenditures % + # Books Purchased % + Book Expenditures % + # Serials Purchased % + 2004
  35. 35. Too expensive for Harvard…
  36. 36. The good news: the fightback has started…
  37. 37. The Solution: Open Access Budapest Open Access Initiative: “By “open access” to [peer-reviewed research literature], we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.” • Maximizes reuse and access • Gives authors control over the integrity of their work and the right to be properly acknowledged and cited. • “Real” OA asks for no restrictions/limitations = CC-BY
  38. 38. Hong Kong: off the map Push the button!
  39. 39. Hong Kong: good with theses…
  40. 40. Hong Kong: still some work to go with OA …Singapore beats us
  41. 41. Pragmatic: Infrastructure:
  42. 42. Pragmatic/Infrastructure: Crowdsourcing, wisdom of the masses Wiki science: GeneWiki • 10,000 distinct gene pages. • 1.42 million words and 78MB data. • 50 million views & 15,000 edits per year. GitHub science: A hypothetical Git workflow for a scientific collaboration involving 3 authors. Karthik Ram:
  43. 43. Open Lab Notebooks
  44. 44. Our crowdsourcing example: To maximize its utility to the research community and aid those fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as: Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
  45. 45. Downstream consequences: 1. Citations (~180) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons 4. Example for faster & more open science “Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and publish their work without wasting time on legal wrangling.”
  46. 46. 1.3 The power of intelligently open data The benefits of intelligently open data were powerfully illustrated by events following an outbreak of a severe gastrointestinal infection in Hamburg in Germany in May 2011. This spread through several European countries and the US, affecting about 4000 people and resulting in over 50 deaths. All tested positive for an unusual and little-known Shiga-toxin–producing E. coli bacterium. The strain was initially analysed by scientists at BGI-Shenzhen in China, working together with those in Hamburg, and three days later a draft genome was released under an open data licence. This generated interest from bioinformaticians on four continents. 24 hours after the release of the genome it had been assembled. Within a week two dozen reports had been filed on an opensource site dedicated to the analysis of the strain. These analyses provided crucial information about the strain’s virulence and resistance genes – how it spreads and which antibiotics are effective against it. They produced results in time to help contain the outbreak. By July 2011, scientists published papers based on this work. By opening up their early sequencing results to international collaboration, researchers in Hamburg produced results that were quickly tested by a wide range of experts, used to produce new knowledge and ultimately to control a public health emergency.
  47. 47. Pragmatic/Infrastructure: Open Innovation Challenges
  48. 48. Public:
  49. 49. Indie Science Biohacker spaces CoResearch labs Crowdfunding DIYbio Open hardware
  50. 50. Biggest crowdfunding successes
  51. 51. Utilizing students: iGEM iGEM:
  52. 52. The “Peoples Parrot” Puerto Rican Parrot Genome Project (Amazona vittata ) Rarest parrot, national bird of Puerto Rico Community funded from artworks, fashion shows, beer brands, crowdfunding… Genome annotated by students in community college as part of bioinformatics education Paper and Data published in GigaScience and GigaDB Taras K Oleksyk, et al., (2012) A Locally Funded Puerto Rican Parrot (Amazona vittata) Genome Sequencing Project Increases Avian Data and Advances Young Researcher Education. GigaScience 2012, 1:14 Steven J. O’Brien. (2012): Genome empowerment for the Puerto Rican parrot – Amazona vittata. GigaScience 2012, 1:13 Oleksyk et al., (2012): Genomic data of the Puerto Rican Parrot (Amazona vittata) from a locally funded project. GigaScience.
  53. 53. Public: Citizen Science Galaxy Zoo: Zoonoverse: 887,355 “Zooites” and counting
  54. 54. Public: Citizen Science 1987-1997
  55. 55. Easy to get started…
  56. 56. Public: Games with a Purpose
  57. 57.
  58. 58. OpenSciDev
  59. 59. OpenSciDev Questions asked: 1. What value framework is a prerequisite for open science? 2. How can open science support visibility and communication of science outside formal academic structures? 3. How can open science create education? 4. How can the economic and social value of open science be measured? Currently working on: • Writing working paper on these questions • Building networks across Africa, Asia, Latin America and the Caribbean. • Setting up call for funding for OpenSciDev projects ($2-3M)
  60. 60. To summarize: • Open data is more than just government data (although research data mostly is government funded too) • Need for OA advocates & policies in Hong Kong (role for ODHK?) • Much science community can still learn about open licensing • Much wider open data community can learn on community engagement from Citizen Science, GWAP, etc. • Asia (inc HK) behind US/EU on many of these activities, but can we learn lessons from success of iGEM and “Jamboreee” model? *…King+