Successfully reported this slideshow.
Your SlideShare is downloading. ×

2015 balti-and-bioinformatics

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 35 Ad
Advertisement

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to 2015 balti-and-bioinformatics (20)

Advertisement

2015 balti-and-bioinformatics

  1. 1. Can we get scientists to share data through self-interest? C. Titus Brown UC Davis ctbrown@ucdavis.edu
  2. 2. Thanks, Nick! This is an attempt to explain why I pitched this: http://ivory.idyll.org/blog/2014-moore-ddd-talk.html and talk about what I’d like to do with the money.
  3. 3. The way data commonly gets published Gather data Analyze data Write paper Publish paper and data
  4. 4. Many failure modes: Gather data Analyze data Write paper Publish paper and dataX Lack of expertise; Lack of tools; Lack of compute; Bad experimental design;
  5. 5. Many failure modes: Gather data Analyze data Write paper Publish paper and dataX (The usual reasons)
  6. 6. One failure mode in particular: Gather data Analyze data Write paper Publish paper and data Other data X
  7. 7. One failure mode in particular: Gather data Analyze data Write paper Publish paper and data Other data X Lots of biological data doesn’t make sense, except in the light of other data. This is especially true in two of the fields I work in, environmental metagenomics and non-model mRNAseq
  8. 8. (For example: gene annotation by homology) Anything else Mollusc Cephalopod no similarity
  9. 9. One failure mode in particular: Gather data Analyze data Write paper Publish paper and data Other data X Lots of biological data doesn’t make sense, except in the light of other data. This is especially true in two of the fields I work in, environmental metagenomics and non-model mRNAseq
  10. 10. Hmm. Data publication Data publication Data analysis Data analysis
  11. 11. I believe: There are many interesting and useful data sets immured behind lab walls by lack of: • Expertise • Tools • Compute • Well-designed experimental setup • Pre-analysis data publication culture in biology • Recognition that sometimes hypotheses just get in the way • Good editorial judgment
  12. 12. I believe: There are many interesting and useful data sets immured behind lab walls by lack of: • Expertise • Tools • Compute • Well-designed experimental setup • Pre-analysis data publication culture in biology • Recognition that sometimes hypotheses just get in the way • Good editorial judgment
  13. 13. (Side note) The existence of journals that will let you publish virtually anything should have really helped data availability! Sadly, many of them don’t enforce data publication rules.
  14. 14. Data publications! The obvious solution: data pubs! (“Pre-publication data sharing”) Make your data available so that others can cite it! GigaScience, Data Science, etc. …but we don’t yet reward this culturally in biology. (True story: no one cares, yet.) I’m actually uncertain myself about how much we should reward data and source code pubs. But we can talk later.
  15. 15. Pre-publication data sharing? There is no obvious reason to make data available prior to publication of its analysis. There is no immediate reward for doing so. Neither is there much systematized reward for doing so. (Citations and kudos feel good, but are cold comfort.) Worse, there are good reasons not to do so. If you make your data available, others can take advantage of it… …but they don’t have to share their data with you in order to do so.
  16. 16. This bears some similarity to the Prisoners’ Dilemma: http://www.acting-man.com/?p=34313 “Confession” here is not sharing your data. Note: I’m not a game theorist (but some of my best friends are).
  17. 17. So, how do we get academics to share their data!? Two successful “systems” (send me more!!) 1. Oceanographic research 2. Biomedical research
  18. 18. 1. Research cruises are expensive! In oceanography, individual researchers cannot afford to set up a cruise. So, they form scientific consortia. These consortia have data sharing and preprint sharing agreements. (I’m told it works pretty well (?))
  19. 19. 2. Some data makes more sense when you have more data Omberg et al., Nature Genetics, 2013. Sage Bionetworks et al.: Organize a consortium to generate data; Standardize data generation; Share via common platform; Store results, provenance, analysis descriptions, and source code; Run a leaderboard for a subset of analyses; Win!
  20. 20. This “walled garden” model is interesting! “Compete” on analysis, not on data.
  21. 21. Some notes - • Sage model requires ~similar data in common format; • Common analysis platform then becomes immediately useful; • Data is ~easily re-usable by participants; • Publication of data becomes straightforward; • Both models are centralized and coordinated.
  22. 22. The $1.5m question(s): • Can we “port” this sharing model over to environmental metagenomics, non-model mRNAseq, and maybe even VetMed and agricultural research? • Can we use this model to drive useful pre- publication data sharing? • Can we take it from a coordinated and centralized model to a decentralized model?
  23. 23. A slight digression - Most data analysis models are based on centralizing data and then computing on it there. This has several failure points: • Political: expect lots of biomedical, environmental data to be restricted geopolitically. • Computation: in the limit of infinite data… • Bandwidth: in the limit of infinite data… • Funding: in the limit of infinite data…
  24. 24. Proposal: distributed graph database server Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
  25. 25. Graph queries across public & walled-garden data sets: See Lee, Alekseyenko, Brown, 2009, SciPy Proceedings: ‘pygr’ project. raw sequence assembled sequence nitrite reductase ppaZ SIMILAR TO ALSO CONTAINS
  26. 26. Graph queries across public & walled-garden data sets: “What data sets contain <this gene>?” “Which reads match to <this gene>, but not in <conserved domain>?” “Give me relative abundance of <gene X> across all data sets, grouped by nitrogen exposure.”
  27. 27. Thesis: If we can provide immediate returns for data sharing, researchers will do so, and do so immediately. Not to do so would place them at a competitive disadvantage. (All the rest is gravy: open analysis system, reproducibility, standardized data format, etc.)
  28. 28. Puzzle pieces. 1. Inexpensive and widely available cloud computing infrastructure? Yep. See Amazon, Google, Rackspace, etc.
  29. 29. Puzzle pieces. 2. The ability to do many or most sequence analyses inexpensively in the cloud? Yep. This is one reason for khmer & khmer-protocols.
  30. 30. Puzzle pieces. 3. Locations to persist indexed data sets for use in search & retrieval? figshare & dryad (?)
  31. 31. Puzzle pieces. 4. Distributed data mining approaches? Some literature, but I know little about it.
  32. 32. In summary: How will we do this? I PLAN TO FAIL. A LOT. PUBLICLY. (ht @ethanwhite)
  33. 33. In summary: How will we know if (or when) we’ve “won”? 1. When people use, extend, and remix our software and concepts without talking to us about it first. (c.f. khmer!) 2. When the system becomes so useful that people go back and upload old data sets to it.
  34. 34. In summary: The larger vision Enable and incentivize sharing by providing immediate utility; frictionless sharing. Permissionless innovation for e.g. new data mining approaches. Plan for poverty with federated infrastructure built on open & cloud. Solve people’s current problems, while remaining agile for the future.
  35. 35. Thanks! References and pointers welcome! https://github.com/ged-lab/buoy (Note: there’s nothing there yet.)

Editor's Notes

  • Analyze data in cloud; import and export important; connect to other databases.
  • Set up infrastructure for distributed query; base on graph database concept of standing relationships between data sets.
  • Work with other Moore DDD folk on the data mining aspect. Start with cross validation, move to more sophisticated in-server implementations.

×