BioMake Chris Mungall Berkeley Drosophila Genome Project [email_address]
build networks <ul><li>Many bioinformaticians spend large amounts of time coding and running  build/make networks </li></u...
Existing approaches <ul><li>Run tasks by hand, or with ad-hoc scripts </li></ul><ul><ul><li>doesn’t scale, leads to insani...
biomake: executable computational protocols <ul><li>A declarative language for specifying build networks </li></ul><ul><ul...
Example Genomic Sequence Analysis Pipeline <ul><li>Prepare and cut assembled sequence into slices </li></ul><ul><li>Downlo...
Target dependencies Assembled Genomic Sequence Chunked Sequence RepeatMasked Sequence Gene Predictions Blast Alignments Fa...
Specifying Targets Chunked Sequence Blast Alignments FastaDB (local) BlastIndexed FastaDB formatdb(D) run: formatdb -i  D ...
BioMake Execution TARGET: blast(blastx, my.seq, nr.fly, -) SUBTARGET: formatdb(nr.fly) RUN: formatdb -i nr -p T OUT: nr.fl...
Targets can be nested bop(S,B) run: apollo -bop -s  S B  -o  target flat:  B .game.xml  store(XML) run: xml2db  XML store(...
Iteration <ul><li>Pipelines frequently involve iterating over collections of data: </li></ul><ul><ul><li>Perform a sequenc...
Iterating over datasets splitfasta(F) run:  splitfasta.pl -d  F -split -md5  F flat:  F -split/ bn(F). pathlist comment:  ...
Controlling the runmode <ul><li>Tasks can be run locally or on a compute farm, synchronously or asynchronously </li></ul><...
runmode example blast(P,S,D,A) req:  formatdb(D) run: blastall -p  P  -i  S  -d  D A flat: S-blast/ bn(S).D . P .out runmo...
Datastores <ul><li>BioMake persists targets in Datastores </li></ul><ul><li>The  flat:  tag flattens targets to unique dat...
Asynchronous Execution local machine running biomake scheduler node NFS For each target T to be built: 1) biomake fetches ...
Specifying rules <ul><li>Pipeline systems often require a rule base </li></ul><ul><ul><ul><li>only do nuc to nuc alignment...
Embedding prolog facts <data relation=“fastadb” cols=“ID,SeqAlphabet,SOType,Org” del=“ws”> </data> <prolog> nucdb(D):- fas...
BioMake module system <ul><li>The biomake core language is generic  </li></ul><ul><ul><li>no bioinformatics-specific code ...
biomake extensibility <ul><li>biomake is a declarative language </li></ul><ul><li>embodies both logical and functional par...
BioMake in use <ul><li>currently we’re using biomake for… </li></ul><ul><ul><li>analysis of repeat families found in ortho...
Running biomake <ul><li>Get distro from  http://skam.sourceforge.net </li></ul><ul><li>Requires XSB Prolog </li></ul><ul><...
Acknowledgements Shengqiang Shu Sima Misra Erwin Frise Eric Smith Mark Yandell George Hartzell Chris Smith Simon Prochnik ...
Problem Specification <ul><li>A build network consist of multiple  Targets </li></ul><ul><ul><li>e.g  the output from a  b...
Upcoming SlideShare
Loading in …5
×

BioMake BOSC 2004

903 views

Published on

http://open-bio.org/bosc2004/accepted_abstracts.html

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
903
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

BioMake BOSC 2004

  1. 1. BioMake Chris Mungall Berkeley Drosophila Genome Project [email_address]
  2. 2. build networks <ul><li>Many bioinformaticians spend large amounts of time coding and running build/make networks </li></ul><ul><li>A build network is a recipe describing the execution of a collection of interdependent heterogeneous tasks </li></ul><ul><ul><li>sequence analysis pipelines </li></ul></ul><ul><ul><li>data ‘compilation’ </li></ul></ul><ul><ul><ul><li>importing, transforming and exporting data </li></ul></ul></ul><ul><ul><li>LIMS </li></ul></ul><ul><li>Error prone, tedious repetitive code and hard to configure </li></ul>
  3. 3. Existing approaches <ul><li>Run tasks by hand, or with ad-hoc scripts </li></ul><ul><ul><li>doesn’t scale, leads to insanity </li></ul></ul><ul><li>Unix/GNU makefiles </li></ul><ul><ul><li>++: concise, generic, high level abstraction </li></ul></ul><ul><ul><li>--: limited expressive power, hacky </li></ul></ul><ul><li>makefile replacements (cons,scons,ant,build…) </li></ul><ul><ul><li>geared towards software development </li></ul></ul><ul><li>Bio compute pipeline software (biopipe, enspipe) </li></ul><ul><ul><li>excellent for certain tasks </li></ul></ul><ul><ul><li>not completely generic </li></ul></ul>
  4. 4. biomake: executable computational protocols <ul><li>A declarative language for specifying build networks </li></ul><ul><ul><li>concise, Turing-complete, highly configurable </li></ul></ul><ul><li>Dependency management </li></ul><ul><ul><li>e.g mask genomic sequence prior to genefinding </li></ul></ul><ul><li>Local and remote job execution </li></ul><ul><ul><li>compute farm job management </li></ul></ul><ul><li>Filesystem or database oriented </li></ul>
  5. 5. Example Genomic Sequence Analysis Pipeline <ul><li>Prepare and cut assembled sequence into slices </li></ul><ul><li>Download latest NR peptide dataset and index it </li></ul><ul><li>Blast genomic slices against NR and other datasets </li></ul><ul><li>RepeatMask genomic slices and run genefinders on masked sequence </li></ul><ul><li>Synthesise gene models and do peptide analysis on results </li></ul><ul><li>Store everything in a relational db, and prepare files for export to public </li></ul>
  6. 6. Target dependencies Assembled Genomic Sequence Chunked Sequence RepeatMasked Sequence Gene Predictions Blast Alignments FastaDB (local) FastaDB (remote) BlastIndexed FastaDB Relational Database Gene Models BlastP & HMM Alignments XML Export XML Import Flatfile Export
  7. 7. Specifying Targets Chunked Sequence Blast Alignments FastaDB (local) BlastIndexed FastaDB formatdb(D) run: formatdb -i D blast(P,S,D,A) req: formatdb(D) run: blastall -p P -i S -d D A flat: S-blast/ bn(S).D . P .out A generic target pattern has a name and arguments A target pattern has tags for specifying dependencies, actions and filesystem or database IDs
  8. 8. BioMake Execution TARGET: blast(blastx, my.seq, nr.fly, -) SUBTARGET: formatdb(nr.fly) RUN: formatdb -i nr -p T OUT: nr.fly.{psq,pin,phr} formatdb-nr.fly.OK RUN: blastall -p blastx -i my.seq -d nr.fly OUT: my.seq-blast/my.seq.nr.fly.blastx.out my.seq-blast/my.seq.nr.fly.blastx.out.OK formatdb(D) run: formatdb -i D blast(P,S,D,A) req: formatdb(D) run: blastall -p P -i S -d D A flat: S-blast/ bn(S).D . P .out
  9. 9. Targets can be nested bop(S,B) run: apollo -bop -s S B -o target flat: B .game.xml store(XML) run: xml2db XML store(bop(my.seq, genscan(repeatmask(my.seq, drosophila)))) target instantiations can be thought of as a skolem IDs repeatmask(S,Org) run: repeatmask S -a Org flat: S .mask genscan(S) run: genscan S flat: S -pred/ bn(S) .genscan.out
  10. 10. Iteration <ul><li>Pipelines frequently involve iterating over collections of data: </li></ul><ul><ul><li>Perform a sequence analysis on every entry in a multi-fasta format file </li></ul></ul><ul><ul><li>Perform a peptide analysis on every gene prediction in some genscan output </li></ul></ul><ul><ul><li>Query a database for a list of IDs and perform some task on each </li></ul></ul><ul><li>biomake has language constructs for iteration </li></ul>
  11. 11. Iterating over datasets splitfasta(F) run: splitfasta.pl -d F -split -md5 F flat: F -split/ bn(F). pathlist comment: splitfasta.pl is part of the biomake distro analyze_multifasta(F) iterate: analyze_seq(S) where S in splitfasta(F) analyze_seq(S) req: genie(S) blast( blastx ,S, nr ,-) MultiFasta >seq1 TAGGTATTGGTT AGGTGCGTCCTC >seq2 GCGGTATAGCTT TTCCTTCTCTCT >seq3 CAAAGCAGAGAT ATATTTATTCGC >seq1 TAGGTATTGGTT AGGTGCGTCCTC >seq2 GCGGTATAGCTT TTCCTTCTCTCT >seq3 CAAAGCAGAGAT ATATTTATTCGC seq1.genie.out seq1.nr.blastx.out seq2.genie seq2.nr.blastx.out seq3.nr.blastx.out seq3.genie
  12. 12. Controlling the runmode <ul><li>Tasks can be run locally or on a compute farm, synchronously or asynchronously </li></ul><ul><ul><li>wrapper provided for PBS </li></ul></ul><ul><li>runmode: tag states the mode and wrapper for a particular target pattern </li></ul><ul><ul><li>can be set globally and per-pattern </li></ul></ul><ul><li>special status targets provide execution status </li></ul>
  13. 13. runmode example blast(P,S,D,A) req: formatdb(D) run: blastall -p P -i S -d D A flat: S-blast/ bn(S).D . P .out runmode: async(qsubwrap) The blast job will be executed on the compute farm via qsubwrap (comes with biomake distro) Upon submission, the status target status_run(blast(P,S,D,A)) will be generated; on completion, the target status_ok (blast(P,S,D,A)) will be generated biomake can automatically handle moving data in and out between user’s filesystem (or db) and local cluster nodes
  14. 14. Datastores <ul><li>BioMake persists targets in Datastores </li></ul><ul><li>The flat: tag flattens targets to unique datastore IDs </li></ul><ul><li>Datastore can be filesystem or relational database </li></ul><ul><ul><li>default is filesystem </li></ul></ul><ul><ul><li>can be set globally or per target </li></ul></ul><ul><ul><ul><li>e.g. analysis result targets can be stored on filesystem, status targets stored in DB </li></ul></ul></ul><ul><li>NFS traffic can be avoided on compute farm by storing targets and status targets in a database </li></ul>
  15. 15. Asynchronous Execution local machine running biomake scheduler node NFS For each target T to be built: 1) biomake fetches status of T skips T if status = ok/run 2) biomake stores status_run(T) 3) biomake creates a runner agent script and submits it to the cluster 4) continue onto next target 1) agent fetches any input data 2) agent runs command (eg blast) synchronously 3) agent stores result 4) agent stores status of T as ‘ ok’ or ‘err’ AGENT AGENT
  16. 16. Specifying rules <ul><li>Pipeline systems often require a rule base </li></ul><ul><ul><ul><li>only do nuc to nuc alignments on one species or two recently diverged species </li></ul></ul></ul><ul><ul><ul><li>use sequence ontology hierarchy to decide analyses or parameters </li></ul></ul></ul><ul><li>biomake protocols can have prolog facts and rules embedded inside them </li></ul><ul><li>biomake distro comes with SO prolog db and rules for graph traversal </li></ul>
  17. 17. Embedding prolog facts <data relation=“fastadb” cols=“ID,SeqAlphabet,SOType,Org” del=“ws”> </data> <prolog> nucdb(D):- fastadb(D,na,_,_). pepdb(D):- fastadb(D,aa,_,_). </prolog> formatdb(D) run: formatdb -i D -p ‘T’ {pepdb(D)} run: formatdb -i D -p ‘F’ {nucdb(D)} na na aa D melanogaster cDNA cdna.fly.fst D melanogaster EST est.fly.fst D melanogaster polypeptide protein.fly.fst
  18. 18. BioMake module system <ul><li>The biomake core language is generic </li></ul><ul><ul><li>no bioinformatics-specific code or tweaks </li></ul></ul><ul><li>biomake uses a module system </li></ul><ul><li>biomake distro comes with </li></ul><ul><ul><li>biosequence_analysis module </li></ul></ul><ul><ul><li>Sequence Ontology prolog db and rules </li></ul></ul><ul><ul><li>scripts for handling bioinformatics data </li></ul></ul>
  19. 19. biomake extensibility <ul><li>biomake is a declarative language </li></ul><ul><li>embodies both logical and functional paradigms </li></ul><ul><li>targets are actually higher order functions </li></ul><ul><li>standard FP functions available in fp module </li></ul><ul><ul><ul><li>cons,map,grep,filter,fold,… </li></ul></ul></ul><ul><li>goal: expressive power, concise specifications and simplicity </li></ul>
  20. 20. BioMake in use <ul><li>currently we’re using biomake for… </li></ul><ul><ul><li>analysis of repeat families found in orthologous and paralogous introns </li></ul></ul><ul><ul><li>Building the Gene Ontology database </li></ul></ul><ul><li>… but we are still dependent on legacy pipeline code for many analyses </li></ul>
  21. 21. Running biomake <ul><li>Get distro from http://skam.sourceforge.net </li></ul><ul><li>Requires XSB Prolog </li></ul><ul><ul><li>http://xsb.sourceforge.net </li></ul></ul><ul><li>Run via command line </li></ul><ul><ul><li>similar to unix make command </li></ul></ul><ul><li>Works on both OS X and Linux </li></ul><ul><li>Relational datastore requires mysql (Pg soon) </li></ul><ul><li>Better docs coming soon, lots of examples </li></ul>
  22. 22. Acknowledgements Shengqiang Shu Sima Misra Erwin Frise Eric Smith Mark Yandell George Hartzell Chris Smith Simon Prochnik Jon Tupy Josh Kaminker Karen Eilbeck Nomi Harris Suzanna Lewis Gerry Rubin
  23. 23. Problem Specification <ul><li>A build network consist of multiple Targets </li></ul><ul><ul><li>e.g the output from a blastx alignment of my.seq to the protein database nr.fly </li></ul></ul><ul><li>Targets have a logical pattern </li></ul><ul><ul><li>e.g blast alignment using P of some seq S vs some db D </li></ul></ul><ul><li>Targets are dependent on other Targets </li></ul><ul><ul><li>e.g blast depends on the indexing of db D using formatdb </li></ul></ul><ul><li>Upstream changes trigger downstream actions </li></ul><ul><li>Targets are built by running actions </li></ul><ul><ul><li>“ formatdb -p T -i nr.fly ” </li></ul></ul><ul><ul><li>“ blastall -p blastx -i my.seq -d nr.fly ” </li></ul></ul>

×