Your SlideShare is downloading. ×
0
2014 moore-ddd
2014 moore-ddd
2014 moore-ddd
2014 moore-ddd
2014 moore-ddd
2014 moore-ddd
2014 moore-ddd
2014 moore-ddd
2014 moore-ddd
2014 moore-ddd
2014 moore-ddd
2014 moore-ddd
2014 moore-ddd
2014 moore-ddd
2014 moore-ddd
2014 moore-ddd
2014 moore-ddd
2014 moore-ddd
2014 moore-ddd
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

2014 moore-ddd

622

Published on

1 Comment
5 Likes
Statistics
Notes
No Downloads
Views
Total Views
622
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
1
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Squeeze information out of data; speed up downstream analyses; make impossible possible.
  • Applicable to many basic sequence analysis problems: error removal, species sorting, and de novo sequence assembly.
  • Hard to tell how many people are using it because it’s freely available in several locations.
  • The point is to enable biology; volume and velocity of data from sequencers is blocking.
  • Doing computational science with good software engineering approaches is enabling; scientist + soft eng grad students are super capable.
  • 1000s of people want to do what we do, can’t collaborate with them all => open protocols. Forkable, ctiable, open, tested. This is your methods section for computational analysis.
  • Analyze data in cloud; import and export important; connect to other databases.
  • Analyze data in cloud; import and export important; connect to other databases.
  • Analyze data in cloud; import and export important; connect to other databases.
  • Analyze data in cloud; import and export important; connect to other databases.
  • Set up infrastructure for distributed query; base on graph database concept of standing relationships between data sets.
  • Work with other Moore DDD folk on the data mining aspect. Start with cross validation, move to more sophisticated in-server implementations.
  • Drive with pilot projects; train domain postdocs in computation; e.g. 20+ sites with multi-omic sampling, clearly the future but no way to analyze the data.
  • Passionate about training; necessary fro advancement of field; also deeply self-interested because I find out what the real problems are. (“Some people can do assembly” is not “everyone can do assembly”)
  • Mention moore science fiction project
  • Transcript

    • 1. Infrastructure for Data Intensive Biology “Better Science through Superior Software” C. Titus Brown
    • 2. Current research: Compressive algorithms for sequence analysis Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Can we enable and accelerate sequence- based inquiry by making all basic analysis easier and some analyses possible?
    • 3. Three super-awesome technologies… 1. Low-memory k-mer counting (Zhang et al., PLoS One, 2014) 2. Compressible assembly graphs (Pell et al., PNAS, 2012) 3. Streaming lossy compression of sequence data (Brown et al., arXiv, 2012)
    • 4. …implemented in one super- awesome software package… github.com/ged-lab/khmer/ BSD licensed Openly developed using good practice. > 10 external contributors. Thousands of downloads/month. 50 citations in 3 years. We think > 1000 people are using it; have heard from dozens.
    • 5. …enabling super-awesome biology. 1. Assembling soil metagenomes Howe et al., PNAS, 2014 2. Understanding bone-eating worm symbionts Goffredi et al., ISME, 2014. 3. An ultra-deep look at the lamprey transcriptome (in preparation) 4. Understanding derived anural development in Molgulid ascidians (in preparation)
    • 6. Early on, lack of replicability in pubs slowed us down => Strategy: “level up” the field High quality & novel science, done openly, written up in reproducible and remixable papers, using IPython Notebook, and posted to preprint servers. Expression based clustering of 85 lamprey tissue samples (de novo assembly of 3 billion reads) ~ 1 month Camille Scott
    • 7. Open protocols for the cloud: ~$100/analysis Read cleaning Preprocessing Assembly Annotation khmer-protocols.readthedocs.org/ Transcriptome and metagenome assembly protocols
    • 8. The data challenge in biology In 5-10 years, we will have nigh-infinite data. (Genomic, transcriptomic, proteomic, metabolomic, …?) We currently have no good way of querying, exploring, investigating, or mining these data sets, especially across multiple locations.. Moreover, most data is unavailable until after publication… …which, in practice, means it will be lost.
    • 9. Proposal: distributed graph database server Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
    • 10. Proposal: distributed graph database server Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
    • 11. Proposal: distributed graph database server Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
    • 12. Proposal: distributed graph database server Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
    • 13. Graph queries assembled sequence nitrite reductase ppaZ SIMILARITY TO ALSO CONTAINS raw sequence across public & walled-garden data sets: See Lee, Alekseyenko, Brown, paper in SciPy 2009: the “pygr” project.
    • 14. The larger vision Enable and incentivize sharing by providing immediate utility; frictionless sharing. Permissionless innovation for e.g. new data mining approaches. Plan for poverty with federated infrastructure built on open & cloud. Solve people’s current problems, while remaining agile for the future.
    • 15. Who needs this? Everyone. Environmental microbiology, evo devo, agriculture, VetMed...
    • 16. How would I start? 1-2 pilot projects w/domain postdocs: drive computational infrastructure with biology problems. Support postdocs with software engineer (infrastructure) and graduate student CS (research). Cross-train postdocs in data- intensive research methods and software engineering. Note: finding existing data is not a problem :) “DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab. Via Elizabeth Kujawinski
    • 17. Education and training Biology is underprepared for data-intensive investigation. We must teach and train the next generations. ~5-10 workshops / year, novice -> masterclass; open materials. Deeply self-interested: What problems does everyone have, now? (Assembly) What problems do leading-edge researchers have? (Data integration)
    • 18. Pre-answered Questions Q: What will be open? A: Everything; I succeed & fail publicly. Q: How will you measure success? A: By other people using & extending our “products” without talking to us. Blog: ivory.idyll.org/blog/ - search for “moore”, “satire” @ctitusbrown
    • 19. Graph queries across public & walled-garden data sets: “What data sets contain <this gene>?” “Which reads match to <this gene>, but not in <conserved domain>?” “Give me relative abundance of <gene X> across all data sets, grouped by nitrogen exposure.”

    ×