Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Experiences with logic programming in bioinformatics

Invited talk at ICLP 2009

  • Login to see the comments

Experiences with logic programming in bioinformatics

  1. 1. Experiences using logic programming in bioinformatics <br />Chris Mungall<br />Berkeley Bioinformatics and Ontologies Group<br /><br />Lawrence Berkeley National Laboratory<br />ICLP 2009<br />
  2. 2. Outline<br />Biology and biological data integration: a brief introduction<br />Obol: First experiences applying LP<br />Blipkit: a reusable bioinformatics developer’s toolkit<br />Modular structure<br />I/O and relational database connectivity<br />Some applications of Blipkit and LP<br />Genes and genomics<br />Phenotype matching<br />Web applications<br />Conclusions<br />Where next? Some recommendations for the LP community<br />
  3. 3. The promise and challenges of biological research<br />Why study biological systems?<br />Because they’re fascinating<br />Improve health<br />Improve the environment<br />BUT: Biology is hard<br />Biological systems are extremely diverse<br />Biology deal with phenomena at multiple levels of granularity<br />There is a deluge of data<br />Bioinformatics<br />Biology as an information science<br />Computational methods vital to understanding <br />
  4. 4. Diversity of biological systems<br />
  5. 5. Biology in the small: Molecules<br />DNA<br />RNA<br />pseudoknot<br />
  6. 6. Cells and organismal biology<br />gastrulation<br />blastula<br />gastrula<br />axon<br />terminal<br />dendrite<br />node of<br />Ranvier<br />soma<br />axon<br />schwann<br />cell<br />cell<br />nucleus<br />myelin<br />sheath<br />
  7. 7. Ecosystems<br />
  8. 8. bio-databases<br />1200 Biological Databases published in Nucleic Acids Research<br />many more unpublished<br />many of these are database federations (e.g. Ensembl)<br />Heterogeneous systems<br />Storage mechanism:<br />Relational<br />XML<br />Flat files<br />Ad-hoc, semi-structured, natural language<br />Limited APIs<br />lack of standards<br />limited query expressivity<br />Poorly integrated<br />Limited integration beyond identifier cross-references<br />Users must manually integrate<br />Bioinformatics runs on perl glue<br />metabolic<br />pathways<br />mutants<br />genes<br />fruit<br />flies<br />tumors<br />
  9. 9. Data interrogation and discovery<br />Sample of tasks<br />Find mutations in regions upstream of neurotransmitter-producing genes<br />Find drug targets or animal models for neurodegenerative diseases<br />What biological pathways are enriched in high acidity environments?<br />Answer each of these is difficult<br />Manual aggregation from lots of databases<br />Various kinds of inference required<br />
  10. 10. OBO: Open Biological Ontologies<br />small<br />large<br />
  11. 11. Obol: First experience with LP in bioinformatics<br />Problem<br />Many existing bio-ontologies were in fact more like terminologies<br />Basic axioms, is_a hierarchies<br />Deeper logical structure implicit in terms<br />Long noun phrases, recursively composed<br />“regulation of transcription during G1 phase of mitotic cell cycle”<br />Existing solutions (2004)<br />Take advantage of semi-controlled syntax of terms<br />Parse using ad-hoc regular expressions<br />Influence of perl in bioinformatics!<br />But context-free grammars (at least) were required<br />
  12. 12. A better solution: Definite Clause Grammars<br />Obol: A collection of domain specific DCGs<br />Significant improvement over perlRegExs<br />Declarative<br />More expressive<br />Integration with simple reasoning<br />Bi-directional:<br />can be used for term generation from logical expressions<br />
  13. 13. Example process grammar<br />process(P) regulation(P) | specification(P) | transcription(P) | ...<br />process(P and during(W)) process(P),[during],process(W).<br />process(P andpart_of(W)) process(P),[of],process(W).<br />regulation( regulates(P) )  [regulation,of],process(P).<br />specification( specifies(C) )  [specification, of], cell(C).<br />cell(C and part_of(O)) ogan(O),cell(C).<br />“regulation of transcription during G1 phase of mitotic cell cycle”<br /><br />regulates(transcription) and during(g1_phase and part_of(mitosis))<br />“regulation of transcription from RNA polymerase II promoter involved in ventral spinal cord interneuron specification”<br /><br />regulates(transcription and has_signal(rna_pol_ii))<br /> and part_of(specifies(interneuron and part_of(ventral_spinal_cord)))<br />
  14. 14. Implementations<br />Obol v1 : 2005<br />XSB<br />DCGs + tabling Earley / chart parsing<br />Basic ontology reasoning (tabling to avoid cycles)<br />Integration into java editing environment (XSB interprolog)<br />Obol v2 : 2006<br />Port to SWI-Prolog<br />Web interface<br />Earley algorithm implementation<br />Backward chaining for simple reasoning<br />Forward chaining for full reasoning<br />Obol v2.5 : 2007<br />Reversion to plain DCGs<br />careful construction to avoid cycles<br /><ul><li>Current
  15. 15. Obol java
  16. 16. Obol v3 : 2009
  17. 17. In progress
  18. 18. OWL-Centric
  19. 19. Built on Thea2</li></ul><br />
  20. 20. Results<br />Obol grammars applied successfully to generate axioms for multiple ontologies<br />particularly the Gene Ontology<br />Still used frequently<br />Lessons learned<br />Small amount of basic LP goes a long way<br />LP techniques not widely known in bioinformatics<br />Different LP systems have different strengths<br />Choosing between them is hard – and frustrating<br />
  21. 21. Could LP prove as successful in the wider bioinformatics arena?<br />Rule-based analysis pipelines<br />prolog &gt; make<br />Integration of ontology reasoning and database queries<br />prolog &gt; datalog &gt; sql<br />Pathways<br />graphs, ASP<br />Genomics<br />Linear transformations, CLP<br />Phylogenetics<br />operations on trees<br />
  22. 22. Toolkit Paradigm: BioPerl<br /><br />Established 1990s<br />Collaborative<br />Open Source, svn repository<br />No funding, all voluntary<br />Modular<br />Namespaces<br />Interrelated<br />Separation of I/O from models<br />Parsers<br />Writers<br />SQL database bindings<br />Publication:<br />The BioPerl toolkit, Stajich et al, Genome Research 2002<br />1044 citations (google scholar)<br /><ul><li>Spinoffs:
  23. 23. biojava
  24. 24. biopython
  25. 25. bioruby
  26. 26. bioocaml
  27. 27.
  28. 28. Parent org
  29. 29. open bioinformatics foundation
  30. 30. Issues
  31. 31. object oriented
  32. 32. perl!</li></li></ul><li>blipkit: biological programming toolkit<br />A general purpose reusable library<br />Takes care of ‘plumbing’ – parsing, writing, interface<br />Deductive database + application framework<br />High modular: one package per domain<br />ontologies<br />genomes<br />structures<br />phylogeny and evolution<br />phenotypes<br />systems biology<br />SWI-Prolog specific<br /><br />
  33. 33. Anatomy of a blip domain package<br />Model(s) of the domain<br />dependencies to other domain modules<br />extensional and intensional predicates<br />I/O<br />parsers/writers for small subset of bioinformatics file formats<br />DCGs or external perl<br />translators for common XML schemas<br />Native prolog serialization of model ‘for free’ <br />Web UI<br />Bridges<br />Relational<br />Other prolog models<br />Ontology models<br />
  34. 34. Domain model modules<br />A model consists of extensional + intensionalpredicates<br />Extensional predicates<br />Unit clauses / facts - Asserted and/or compiled from fact files<br />Akin to relational tables<br />Intensional predicates<br />Declarative: No I/O side effects<br />Prolog has no built in extensional/intensional distinction<br />All clauses treated equally<br />Facts conventionally declared dynamic/1 and multifile/1<br />Some metamodeling is useful<br />Easy to roll own<br />A standard metamodel module would be useful<br />optional type system + relational DDL style constraints<br />Works as documentation<br />
  35. 35. Example from systems biology model<br />%%reaction_modifier(?R,?P) is nondet<br />% relation between a biochemical reaction and a molecular constituent that plays a role in the process but is unmodified<br />:- extensional(reaction_modifier/2).<br />% --- INTENSIONAL PREDICATES ---<br />%%derivation_link(?Input,?Output,?Via)<br />% two species directly linked via a connecting<br />% reaction (excludes modifiers)<br />derivation_link(Input,Output,R):-<br />reaction_reactant(R,Input),<br />reaction_product(R,Output).<br />%...[snip]…<br />:- module(sb_db,[ reaction_product/2, reaction_reactant/2, reaction_modifier/2, derivation_link/3, …]).<br />:- use_module(bio(dbmeta)). % metamodel<br />%%reaction_product(?R,?P) is nondet<br />% relation between a biochemical reaction and a molecular constituent produced in the reaction<br />:- extensional(reaction_product/2).<br />%% reaction_reactant(?R,?P) is nondet<br />% relation between a biochemical reaction and a molecular constituent that is consumed in the reaction<br />:- extensional(reaction_reactant/2).<br />
  36. 36. Integrating with relational databases<br />Most biological data stored in relational databases<br />Many provide open SQL ports for distributed queries<br />RDBs scale well with large quantities of data<br />…but RDBs lack necessary deductive capabilities<br />Expressivity Hierarchy<br />FOL<br />Pure prolog<br />Datalog<br />Relational Model<br />Using prolog with RDBs should be easy… right?<br />
  37. 37. sql_compiler<br />Given a mapping to a relational schema:<br />rewrites prolog terms as SQL queries<br />Used in conjunction with db connectivity module<br />History<br />Draxler, 1992<br />Source forked, modified versions available with various prologs<br />Blip includes extensions to<br />Rewrite sub-optimal queries<br />Rewrite non-recursive prolog clauses<br />Integrate with SWI ODBC<br />
  38. 38. Example query rewriting<br />program rewriting<br />program<br />?- sqlbind(sb_db:all, mydb).<br />derivation_link(Input,Output,R):-<br />reaction_reactant(R,Input),<br />reaction_product(R,Output).<br />call goal<br />?- derivation_link(X,Y)<br />schema metadata<br />+<br />relation(reac_in,2).<br />attribute(1,react_in,reac_id,int).<br />attribute(2,react_in,input_id,int).<br />relation(reac_out,2).<br />attribute(1,react_out,reac_id,int).<br />attribute(2,react_out,output_id,int).<br />query rewriting<br />+<br />SELECT <br />reac_in.reac_id,<br />reac_in.input_id,<br />reac_in.output_id<br />FROM reac_in, reac_out<br />WHERE reac_in.reac_id=reac_out.reac_id;<br />mapping<br />reaction_reactant(R,P) &lt;-<br />reac_in(R,P).<br />reaction_product(R,P) &lt;-<br />reac_out(R,P).<br /><br />
  39. 39. Obtaining data from web services<br />Many large bioinformatics data providers provide RESTful APIs<br />NCBI<br />caBIG<br />SWI libraries used<br />http_client<br />sgml (for parsing XML payloads)<br />XML -&gt; Models<br />Direct translation of sgml too low level<br />XSLT-inspired prolog template-oriented processing language<br />Application:<br />ontology enhanced search term expansion<br />E.g. “find all genes implicated in neurodegenerative disease”<br /> ‘parkinsons’ OR ‘alzheimers’ OR …<br />
  40. 40. Applications of Blipkit and LP techniques<br />Genomics and DNA sequences<br />Deduction of implicit information<br />Consistency checking of genome datasets<br />Phenotype matching<br />Finding similarities of mutational effects<br />
  41. 41. Genome inference<br />Deluge of genomic data<br />Cost per genome decreasing<br />Soon we will all know our genome sequence<br />But what does it mean?<br />Effective use of genomics data relies on deductive inference<br />Many rules are logical: genome calculus<br />Currently encoded using ad-hoc imperative code<br />Probabilistic inference also useful<br />But must be built on top of the logical inference<br />
  42. 42. DNA<br />human chromosome 1:<br /> 247m base pairs, 4220 genes<br />Entire genome:<br /> 3x109 bps, 20k genes<br />T<br />A<br />G<br />C<br />
  43. 43. DNA<br />human chromosome 1:<br /> 247m base pairs, 4220 genes<br />Entire genome:<br /> 3x109 bps, 20k genes<br />T<br />A<br />G<br />C<br />Gene expression:<br />transcription<br />splicing<br />translation<br />
  44. 44. Transcription<br />A subsequence of a DNA sequence is<br />transcribed to an RNA sequence<br /> regulated by sequence called promoters and<br />enhancers<br />
  45. 45. Splicing<br />Zero or more subsequences (introns) of the RNA <br />sequence are spliced out. The remaining sequences<br />(exons) are joined together at splice sites.<br /><ul><li>guided by splice site sequences
  46. 46. combinatorial possibilities</li></li></ul><li>Translation<br />5’ (upstream)<br />UTR<br />3’ (upstream)<br />UTR<br />CDS<br />exon 1<br />exon 2<br />exon 3<br />A subsequence of the RNA sequence (the <br />Coding Sequence Region -- CDS) is translated<br />using a genetic translation table.<br />- {A,C,G,U}x3  Amino Acid<br /><ul><li>Not all RNAs are coding </li></li></ul><li>Formalization of gene expression<br />Genome calculus<br />operations on linear sequences<br />subsequence, join, translate<br />Certain sequence types are entailed by other sequences<br />Calculus is surprisingly conserved across all life<br />but biology is fuzzy and full of exceptions<br />Archaea utilize different translation table<br />Nematodes add trans-splicing<br />Mammalian introns are huge<br />Many genes are co-transcribed<br />Viral genes overlap in different translation frames… <br />
  47. 47. Genomics databases<br />Genome databases are important for<br />biomedicine<br />understanding evolution in a molecular level<br />Problem: genome databases are incomplete<br />stating all implicit features leads to redundancy<br />integration and complex queries difficult<br />ad-hoc rules embedded in imperative code<br />Problem: genome databases are inconsistent<br />Different interpretation of gene, exon, UTR etc<br />
  48. 48. Solution: Sequence Ontology + Deductive Database<br />The Sequence Ontology standardizes sequence terms<br />Additional axioms are being added<br />Encoding genome calculus<br />Genome relations based on Allen Interval Algebra<br />Can be used in conjunction with a deductive genome database<br />consistency checking<br />does this genome dataset make sense?<br />inference and querying<br />what entities are present in region X?<br />
  49. 49. Sequence relationship predicates based on Allen Interval Algebra<br />no recursion<br />conjunction of binary terms<br />uses arithmetic (for efficiency)<br />Extensions:<br />strands<br />circular genomes<br />upstream_of(X,Y) :-<br />has_end(X,XE),<br />has_start(Y,YS),<br /> XE &lt; YS. <br />?- upstream_of(exon3,X).<br />X=exon1 ;<br />X=exon2<br />exon3<br />exon1<br />exon2<br />exon4<br />exon5<br />
  50. 50. Intron-exon inference<br />intron( i(T,S,E) ) :- <br /> exon(X1),<br /> exon(X2),<br /> has_end(X1,S,T),<br /> has_start(X2,E,T),<br /> + ((exon(X3),<br /> contained_by(X3,T),<br /> starts_after_start_of(X3,X1),<br /> ends_before_end_of(X3,X2))).<br /><ul><li>function terms as arguments
  51. 51. possibility of recursion through negation</li></ul>exon(exon1). exon(exon2).<br />has_end(exon1,1000,t1).<br />has_start(exon2,2000,t2).<br />?- intron(X).<br />X = i(t1,1000,2000)<br />t1<br />exon1<br />exon2<br />
  52. 52. OWL implementation<br />Many axioms cannot be expressed in OWL<br />Interval relations – no arithmetic in OWL<br />option 1: use SWRL<br />option 2: enumerate all base pairs and use property chain axioms<br />Cannot infer properties of unnamed individuals<br />E.g. introns from exons<br />Cyclic structures cannot be described<br />Requires Description Graph extension<br />Open World Assumption<br />useful for semantic web<br />CWA is more convenient for genomics<br />
  53. 53. Deductive database implementation<br />Methods:<br />Convert sequence ontology OWL-&gt;DLP via Thea2<br />Manually edit<br />Add rules that cannot be expressed in OWL<br />Tested on XSB and Yap<br />requires tabling<br />Results<br />Currently scales to small regions<br />more debugging required<br />difficult to eliminate unstratified negation<br />
  54. 54. Disjunctive datalog implementation<br />Adds:<br />Constraints<br />Disjunctions in rule heads<br />Implementation<br />DLV-Complex : allows functions in arguments<br />Program written from scratch: Rules must be ‘safe’<br />Results<br />Scales over small regions<br />Useful for detecting inconsistencies in data<br />More research needed<br />More efficient programs<br />Use of relational database backend<br />Further exploration of ASP semantics<br />Genomic rules have many exceptions<br />
  55. 55. Prolog implementation<br />Removes:<br />rules that cause cycles with backtracking<br />Implementation<br />Optional use of Nested Containment List library (C + SWI FLI)<br />Results<br />Results can be incomplete due to missing rules<br />E.g. intron :- exon, but not exon :- intron<br />Ruleset can be tailored for dataset<br />Scales over medium sized datasets<br />
  56. 56. Hybrid Prolog-Relational implementation<br />Uses same program as prolog implementation<br />Relational database store facts (extensional)<br />can be distributed<br />Uses sql_compiler + mappings to genomics databases<br />Ensembl<br />Chado<br />Non-recursive prolog rules dynamically translated to complex SQL<br />Recursive subclass rules translated<br />by query compiler using UNIONs<br />precomputed and stored in relational database<br />Scales to full genomes<br />
  57. 57. LP for genomics: conclusions<br />No one paradigm is perfect<br />Many axioms cannot be expressed in OWL<br />but tools are good<br />Disjunctive Datalog good for consistency checking in small regions<br />More research required on efficiency of tabling solution, ASPs<br />WAM solution most efficient<br />Manually rewriting programs is tedious!<br />Hybrid solutions useful<br />RDBs for asserted facts<br />
  58. 58. Application: for diseases<br />Organisms have phenotypes<br />characteristics under the control of the genes of that organism<br />Related genes can have similar phenotypic effects<br />even when the least common ancestor of the gene is 500m years ago<br />Finding these genes can help understand<br />disease<br />evolution<br />
  59. 59. Application: for diseases<br />
  60. 60. Semantic Similarity<br />Given a collection of<br />features F = {f1, f2, …}<br />attributes A = {a1, a2, …}<br />feature-attribute mappings:<br />a(f) = F x A<br />For any feature pair x,y, calculate:<br />Jacard coefficient<br />|a(x) ∩ a(y)| / |a(x)∪ a(y)|<br />maximum IC<br />IC(a) = -log2p(a)<br />maxIC(x,y) = Max[IC(a) : a ∈a(x)∩ a(y)] <br />
  61. 61. SWI-Prolog implementation<br />Uses GMP<br />normal prolog programs have unbounded integer arithmetic<br />allows fast bitwise implementations of set intersection/union<br />Encode feature attribute lists as integers<br />m : A  {0, .., |A|-1}<br />ai(f) = ∑ 2 m(a) a ∈ a(f)<br />Set intersection and union computed using bitwise and/or<br />Fast implementation of Jacard coefficient<br />J is (A1 / A2 / A1 / A2) <br />
  62. 62. Similarity metrics + reasoning<br />Attributes are description logic class expressions<br />rarely exact matches across species<br />a(human1)<br />a(zebrafish7)<br />≠<br />dystrophic∩<br />∃quality_of. arm_muscle<br />atrophied∩ <br />∃quality_of.pectoral_fin_muscle<br />a(human1) ∩ a(zebrafish7) = {} <br />
  63. 63. Use reasoning to find subsumer<br />Find Least Common Ancestor expression<br />typically class expression, not named class<br />a(human1)<br />a(zebrafish7)<br />decreased_size∩ ∃quality_of. muscle_of_upper_limb<br />dystrophic∩<br />∃quality_of. arm_muscle<br />atrophied∩ <br />∃quality_of.pectoral_fin_muscle<br /> a*(human1) ∩ a*(zebrafish7) = {decreased_size∩ ∃quality_of. muscle_of_upper_limb}<br />
  64. 64. Implementation: Uses Thea2<br />Thea2 is a prolog package for OWL2<br /><br />reads/writes<br />RDF/XML<br />OWL-XML<br />Native prolog form<br />Description Logic Programs (DLPs)<br />Reasoning strategies<br />Prolog<br />DL reasoners (via JPL/OWLAPI)<br />SQL DB + forward chaining<br />
  65. 65. Phenotype matching: Results<br />Proof of concept on 10 human disease genes<br />publication forthcoming<br />Currently applying to neurodegenerative diseases<br />Funding to extend to all Mendelian diseases<br />
  66. 66. Web Applications<br /><br />Web interface to Open Bio Ontologies<br />Implemented in perl + SWI-Prolog<br />Prototype for future development<br />SWI-Prolog<br />Production version in perl and/or java<br />
  67. 67. Experiences using LP for bioinformatics: conclusions<br />A little bit of LP goes a long way<br />The theory-application gap is largely untapped<br />A variety of LP paradigms are useful<br />ASP, datalog, DLs, prolog, ILP, …<br />Interoperation can be hard!<br />LP for ‘real world’ applications<br />It is possible!<br />Declarative approach arguably superior<br />Web/database applications are a sweet spot<br />We need to show more success stories<br />..and to dispel myths<br />
  68. 68. Recommendation: make it easier for users<br />Documentation:<br />Unify community knowledge in a single wiki<br />Create a general LP mail list<br />c.f. OWL/SemWeb community<br />Tools:<br />Program analysis<br />Lint-like tool for tabled prologs, ASP<br />Visualization<br />Libraries<br />CPAN for Prolog<br />
  69. 69. Recommendation: make it open-source<br />Why<br />Encourages collaboration<br />Bioinformaticianslove open source<br />The people who fund bioinformaticians love open source<br />Open source can still generate revenue<br />How<br />Deposit code in open source code repositories<br />github, sourceforge, googlecode, etc<br />Embrace Web 2.0<br />blog it, put it on a wiki<br />
  70. 70. Recommendation: interoperate with RDBs<br />Why?<br />RDBs and LP should be a natural match<br />Application developers are conservative and familiar with RDBs<br />lightweight in-memory embedded RDBs are becoming more popular<br />How:<br />Hide LP systems behind pseudo-SQL interface<br />SQL queries and DDL translated behind the scenes. cfsql_compiler<br />Users can use native LP syntax and semantics as they feel comfortable<br />Embed LP systems directly in RDBs<br />E.g. PostgreSQL extensions<br />Improve prolog-&gt;SQL interfaces<br />Common API c.f. JDBC (Java), DBI (Perl)<br />
  71. 71. Recommendation: A unified API to all LP systems<br />Use case:<br />calling LP system from host language (java, perl, ruby, even other prolog)<br />Problem:<br />No standardization amongst APIs<br />Analagous problem:<br />RDB APIs<br />Solved: a 20th century problem<br />Proposal:<br />Common REST interface<br />Single interface per host language<br />
  72. 72. Interoperation between LP systems<br />LP systems (ILP, ASP, Prolog, …) differ in whether they accept:<br />Foo(x).<br />‘Foo’(x).<br />‘foobar’(x).<br />foo(‘xy’).<br />foo(“xy”).<br />Non-prolog systems should:<br />Adhere to ISO standard for intersection with pure prolog<br />Or at least provide ISO mode<br />Also:<br />ISO Common Logic<br />W3C RIF<br />
  73. 73. Future directions<br />Scalable LP<br />Probabilistic + logic modeling<br />CLP(Bayes)<br />PRISM<br />
  74. 74. Robot scientist<br />The Automation of Science<br />King et al.<br />Science 3 April 2009: 85-89<br />DOI: 10.1126/science.1165620<br /><br />
  75. 75. Acknowledgments<br />Vangelis Vassiliadis (Thea)<br />Stephen Veitch (intervaldb)<br />ChristophDraxler (sql_compiler)<br />Jan Wielemaker + SWI Mail list<br />Paulo Moura<br />Vítor Santos Costa + Yap developers<br />Terrence Swift + XSB developers<br />