Your SlideShare is downloading. ×
Experiences with logic programming in bioinformatics
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Experiences with logic programming in bioinformatics

2,348
views

Published on

Invited talk at ICLP 2009

Invited talk at ICLP 2009

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,348
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
59
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • data: curse and a blessing
  • typically download flat files of data and manually integrate@article{stein_perl_1996, title = {How Perl saved the human genome project}, volume = {1}, number = {0001}, journal = {The Perl Journal}, author = {L. Stein}, year = {1996}},@article{stein_creatingbioinformatics_2002, title = {Creating a bioinformatics nation}, volume = {417}, number = {6885}, journal = {Nature}, author = {L. Stein}, year = {2002}, pages = {119--120}},@article{stein_integrating_2003, title = {Integrating biological databases}, volume = {4}, number = {5}, journal = {Nature Reviews Genetics}, author = {L. D. Stein}, year = {2003}, pages = {337--345}}
  • Damascene conversion
  • @ARTICLE{Stajich2002, author = {Stajich, J. E. and Block, D. and Boulez, K. and Brenner, S. E. andChervitz, S. A. and Dagdigian, C. and Fuellen, G. and Gilbert, J. G. and Korf, I. and Lapp, H. and Lehvaslaiho, H. and Matsalla, C. and Mungall, C. J. and Osborne, B. I. and Pocock, M. R. and Schattner, P. and Senger, M. and Stein, L. D. and Stupka, E. and Wilkinson, M. D. and Birney, E.}, title = {The Bioperl toolkit: Perl modules for the life sciences}, journal = {Genome Res}, year = {2002}, volume = {12}, pages = {1611-8}, number = {10}, note = {1088-9051 Journal Article},
  • The empty extension problem
  • extensional is a macro for dynamic + multifile. Also asserts facts in metamodel module allowing introspection, saving etc. Still some repetition.pldoc comments. harder to extract metamodel info. more typing would be good.metamodel directives don’t do much : graceful failure when no data. i/o
  • same code for both in-memory and rdb – amazing!! v powerful. non-recursive only. choose when to swap out prolog store and use rdb.with recursive clauses, can choose to bind only the fact predicates
  • The term ‘junk DNA’ is outdated
  • Pax6 is master regulator. shared anc.5bn yrs.fly eyes vastly different.
  • Pax6 is master regulator. shared anc.5bn yrs.fly eyes vastly different.
  • Transcript

    • 1. Experiences using logic programming in bioinformatics
      Chris Mungall
      Berkeley Bioinformatics and Ontologies Group
      http://berkeleybop.org
      Lawrence Berkeley National Laboratory
      ICLP 2009
    • 2. Outline
      Biology and biological data integration: a brief introduction
      Obol: First experiences applying LP
      Blipkit: a reusable bioinformatics developer’s toolkit
      Modular structure
      I/O and relational database connectivity
      Some applications of Blipkit and LP
      Genes and genomics
      Phenotype matching
      Web applications
      Conclusions
      Where next? Some recommendations for the LP community
    • 3. The promise and challenges of biological research
      Why study biological systems?
      Because they’re fascinating
      Improve health
      Improve the environment
      BUT: Biology is hard
      Biological systems are extremely diverse
      Biology deal with phenomena at multiple levels of granularity
      There is a deluge of data
      Bioinformatics
      Biology as an information science
      Computational methods vital to understanding
    • 4. Diversity of biological systems
    • 5. Biology in the small: Molecules
      DNA
      RNA
      pseudoknot
    • 6. Cells and organismal biology
      gastrulation
      blastula
      gastrula
      axon
      terminal
      dendrite
      node of
      Ranvier
      soma
      axon
      schwann
      cell
      cell
      nucleus
      myelin
      sheath
    • 7. Ecosystems
    • 8. bio-databases
      1200 Biological Databases published in Nucleic Acids Research
      many more unpublished
      many of these are database federations (e.g. Ensembl)
      Heterogeneous systems
      Storage mechanism:
      Relational
      XML
      Flat files
      Ad-hoc, semi-structured, natural language
      Limited APIs
      lack of standards
      limited query expressivity
      Poorly integrated
      Limited integration beyond identifier cross-references
      Users must manually integrate
      Bioinformatics runs on perl glue
      metabolic
      pathways
      mutants
      genes
      fruit
      flies
      tumors
    • 9. Data interrogation and discovery
      Sample of tasks
      Find mutations in regions upstream of neurotransmitter-producing genes
      Find drug targets or animal models for neurodegenerative diseases
      What biological pathways are enriched in high acidity environments?
      Answer each of these is difficult
      Manual aggregation from lots of databases
      Various kinds of inference required
    • 10. OBO: Open Biological Ontologies
      small
      large
    • 11. Obol: First experience with LP in bioinformatics
      Problem
      Many existing bio-ontologies were in fact more like terminologies
      Basic axioms, is_a hierarchies
      Deeper logical structure implicit in terms
      Long noun phrases, recursively composed
      “regulation of transcription during G1 phase of mitotic cell cycle”
      Existing solutions (2004)
      Take advantage of semi-controlled syntax of terms
      Parse using ad-hoc regular expressions
      Influence of perl in bioinformatics!
      But context-free grammars (at least) were required
    • 12. A better solution: Definite Clause Grammars
      Obol: A collection of domain specific DCGs
      Significant improvement over perlRegExs
      Declarative
      More expressive
      Integration with simple reasoning
      Bi-directional:
      can be used for term generation from logical expressions
    • 13. Example process grammar
      process(P) regulation(P) | specification(P) | transcription(P) | ...
      process(P and during(W)) process(P),[during],process(W).
      process(P andpart_of(W)) process(P),[of],process(W).
      regulation( regulates(P) )  [regulation,of],process(P).
      specification( specifies(C) )  [specification, of], cell(C).
      cell(C and part_of(O)) ogan(O),cell(C).
      “regulation of transcription during G1 phase of mitotic cell cycle”
      
      regulates(transcription) and during(g1_phase and part_of(mitosis))
      “regulation of transcription from RNA polymerase II promoter involved in ventral spinal cord interneuron specification”
      
      regulates(transcription and has_signal(rna_pol_ii))
      and part_of(specifies(interneuron and part_of(ventral_spinal_cord)))
    • 14. Implementations
      Obol v1 : 2005
      XSB
      DCGs + tabling Earley / chart parsing
      Basic ontology reasoning (tabling to avoid cycles)
      Integration into java editing environment (XSB interprolog)
      Obol v2 : 2006
      Port to SWI-Prolog
      Web interface
      Earley algorithm implementation
      Backward chaining for simple reasoning
      Forward chaining for full reasoning
      Obol v2.5 : 2007
      Reversion to plain DCGs
      careful construction to avoid cycles
      http://wiki.geneontology.org/index.php/Obol
    • 20. Results
      Obol grammars applied successfully to generate axioms for multiple ontologies
      particularly the Gene Ontology
      Still used frequently
      Lessons learned
      Small amount of basic LP goes a long way
      LP techniques not widely known in bioinformatics
      Different LP systems have different strengths
      Choosing between them is hard – and frustrating
    • 21. Could LP prove as successful in the wider bioinformatics arena?
      Rule-based analysis pipelines
      prolog > make
      Integration of ontology reasoning and database queries
      prolog > datalog > sql
      Pathways
      graphs, ASP
      Genomics
      Linear transformations, CLP
      Phylogenetics
      operations on trees
    • 22. Toolkit Paradigm: BioPerl
      http://www.bioperl.org/
      Established 1990s
      Collaborative
      Open Source, svn repository
      No funding, all voluntary
      Modular
      Namespaces
      Interrelated
      Separation of I/O from models
      Parsers
      Writers
      SQL database bindings
      Publication:
      The BioPerl toolkit, Stajich et al, Genome Research 2002
      1044 citations (google scholar)
    • blipkit: biological programming toolkit
      A general purpose reusable library
      Takes care of ‘plumbing’ – parsing, writing, interface
      Deductive database + application framework
      High modular: one package per domain
      ontologies
      genomes
      structures
      phylogeny and evolution
      phenotypes
      systems biology
      SWI-Prolog specific
      http://blipkit.org
    • 33. Anatomy of a blip domain package
      Model(s) of the domain
      dependencies to other domain modules
      extensional and intensional predicates
      I/O
      parsers/writers for small subset of bioinformatics file formats
      DCGs or external perl
      translators for common XML schemas
      Native prolog serialization of model ‘for free’
      Web UI
      Bridges
      Relational
      Other prolog models
      Ontology models
    • 34. Domain model modules
      A model consists of extensional + intensionalpredicates
      Extensional predicates
      Unit clauses / facts - Asserted and/or compiled from fact files
      Akin to relational tables
      Intensional predicates
      Declarative: No I/O side effects
      Prolog has no built in extensional/intensional distinction
      All clauses treated equally
      Facts conventionally declared dynamic/1 and multifile/1
      Some metamodeling is useful
      Easy to roll own
      A standard metamodel module would be useful
      optional type system + relational DDL style constraints
      Works as documentation
    • 35. Example from systems biology model
      %%reaction_modifier(?R,?P) is nondet
      % relation between a biochemical reaction and a molecular constituent that plays a role in the process but is unmodified
      :- extensional(reaction_modifier/2).
      % --- INTENSIONAL PREDICATES ---
      %%derivation_link(?Input,?Output,?Via)
      % two species directly linked via a connecting
      % reaction (excludes modifiers)
      derivation_link(Input,Output,R):-
      reaction_reactant(R,Input),
      reaction_product(R,Output).
      %...[snip]…
      :- module(sb_db,[ reaction_product/2, reaction_reactant/2, reaction_modifier/2, derivation_link/3, …]).
      :- use_module(bio(dbmeta)). % metamodel
      %%reaction_product(?R,?P) is nondet
      % relation between a biochemical reaction and a molecular constituent produced in the reaction
      :- extensional(reaction_product/2).
      %% reaction_reactant(?R,?P) is nondet
      % relation between a biochemical reaction and a molecular constituent that is consumed in the reaction
      :- extensional(reaction_reactant/2).
    • 36. Integrating with relational databases
      Most biological data stored in relational databases
      Many provide open SQL ports for distributed queries
      RDBs scale well with large quantities of data
      …but RDBs lack necessary deductive capabilities
      Expressivity Hierarchy
      FOL
      Pure prolog
      Datalog
      Relational Model
      Using prolog with RDBs should be easy… right?
    • 37. sql_compiler
      Given a mapping to a relational schema:
      rewrites prolog terms as SQL queries
      Used in conjunction with db connectivity module
      History
      Draxler, 1992
      Source forked, modified versions available with various prologs
      Blip includes extensions to
      Rewrite sub-optimal queries
      Rewrite non-recursive prolog clauses
      Integrate with SWI ODBC
    • 38. Example query rewriting
      program rewriting
      program
      ?- sqlbind(sb_db:all, mydb).
      derivation_link(Input,Output,R):-
      reaction_reactant(R,Input),
      reaction_product(R,Output).
      call goal
      ?- derivation_link(X,Y)
      schema metadata
      +
      relation(reac_in,2).
      attribute(1,react_in,reac_id,int).
      attribute(2,react_in,input_id,int).
      relation(reac_out,2).
      attribute(1,react_out,reac_id,int).
      attribute(2,react_out,output_id,int).
      query rewriting
      +
      SELECT
      reac_in.reac_id,
      reac_in.input_id,
      reac_in.output_id
      FROM reac_in, reac_out
      WHERE reac_in.reac_id=reac_out.reac_id;
      mapping
      reaction_reactant(R,P) <-
      reac_in(R,P).
      reaction_product(R,P) <-
      reac_out(R,P).
      odbc.pl
    • 39. Obtaining data from web services
      Many large bioinformatics data providers provide RESTful APIs
      NCBI
      caBIG
      SWI libraries used
      http_client
      sgml (for parsing XML payloads)
      XML -> Models
      Direct translation of sgml too low level
      XSLT-inspired prolog template-oriented processing language
      Application:
      ontology enhanced search term expansion
      E.g. “find all genes implicated in neurodegenerative disease”
       ‘parkinsons’ OR ‘alzheimers’ OR …
    • 40. Applications of Blipkit and LP techniques
      Genomics and DNA sequences
      Deduction of implicit information
      Consistency checking of genome datasets
      Phenotype matching
      Finding similarities of mutational effects
    • 41. Genome inference
      Deluge of genomic data
      Cost per genome decreasing
      Soon we will all know our genome sequence
      But what does it mean?
      Effective use of genomics data relies on deductive inference
      Many rules are logical: genome calculus
      Currently encoded using ad-hoc imperative code
      Probabilistic inference also useful
      But must be built on top of the logical inference
    • 42. DNA
      human chromosome 1:
      247m base pairs, 4220 genes
      Entire genome:
      3x109 bps, 20k genes
      T
      A
      G
      C
    • 43. DNA
      human chromosome 1:
      247m base pairs, 4220 genes
      Entire genome:
      3x109 bps, 20k genes
      T
      A
      G
      C
      Gene expression:
      transcription
      splicing
      translation
    • 44. Transcription
      A subsequence of a DNA sequence is
      transcribed to an RNA sequence
      regulated by sequence called promoters and
      enhancers
    • 45. Splicing
      Zero or more subsequences (introns) of the RNA
      sequence are spliced out. The remaining sequences
      (exons) are joined together at splice sites.
      • guided by splice site sequences
      • 46. combinatorial possibilities
    • Translation
      5’ (upstream)
      UTR
      3’ (upstream)
      UTR
      CDS
      exon 1
      exon 2
      exon 3
      A subsequence of the RNA sequence (the
      Coding Sequence Region -- CDS) is translated
      using a genetic translation table.
      - {A,C,G,U}x3  Amino Acid
      • Not all RNAs are coding
    • Formalization of gene expression
      Genome calculus
      operations on linear sequences
      subsequence, join, translate
      Certain sequence types are entailed by other sequences
      Calculus is surprisingly conserved across all life
      but biology is fuzzy and full of exceptions
      Archaea utilize different translation table
      Nematodes add trans-splicing
      Mammalian introns are huge
      Many genes are co-transcribed
      Viral genes overlap in different translation frames…
    • 47. Genomics databases
      Genome databases are important for
      biomedicine
      understanding evolution in a molecular level
      Problem: genome databases are incomplete
      stating all implicit features leads to redundancy
      integration and complex queries difficult
      ad-hoc rules embedded in imperative code
      Problem: genome databases are inconsistent
      Different interpretation of gene, exon, UTR etc
    • 48. Solution: Sequence Ontology + Deductive Database
      The Sequence Ontology standardizes sequence terms
      Additional axioms are being added
      Encoding genome calculus
      Genome relations based on Allen Interval Algebra
      Can be used in conjunction with a deductive genome database
      consistency checking
      does this genome dataset make sense?
      inference and querying
      what entities are present in region X?
    • 49. Sequence relationship predicates based on Allen Interval Algebra
      no recursion
      conjunction of binary terms
      uses arithmetic (for efficiency)
      Extensions:
      strands
      circular genomes
      upstream_of(X,Y) :-
      has_end(X,XE),
      has_start(Y,YS),
      XE < YS.
      ?- upstream_of(exon3,X).
      X=exon1 ;
      X=exon2
      exon3
      exon1
      exon2
      exon4
      exon5
    • 50. Intron-exon inference
      intron( i(T,S,E) ) :-
      exon(X1),
      exon(X2),
      has_end(X1,S,T),
      has_start(X2,E,T),
      + ((exon(X3),
      contained_by(X3,T),
      starts_after_start_of(X3,X1),
      ends_before_end_of(X3,X2))).
      • function terms as arguments
      • 51. possibility of recursion through negation
      exon(exon1). exon(exon2).
      has_end(exon1,1000,t1).
      has_start(exon2,2000,t2).
      ?- intron(X).
      X = i(t1,1000,2000)
      t1
      exon1
      exon2
    • 52. OWL implementation
      Many axioms cannot be expressed in OWL
      Interval relations – no arithmetic in OWL
      option 1: use SWRL
      option 2: enumerate all base pairs and use property chain axioms
      Cannot infer properties of unnamed individuals
      E.g. introns from exons
      Cyclic structures cannot be described
      Requires Description Graph extension
      Open World Assumption
      useful for semantic web
      CWA is more convenient for genomics
    • 53. Deductive database implementation
      Methods:
      Convert sequence ontology OWL->DLP via Thea2
      Manually edit
      Add rules that cannot be expressed in OWL
      Tested on XSB and Yap
      requires tabling
      Results
      Currently scales to small regions
      more debugging required
      difficult to eliminate unstratified negation
    • 54. Disjunctive datalog implementation
      Adds:
      Constraints
      Disjunctions in rule heads
      Implementation
      DLV-Complex : allows functions in arguments
      Program written from scratch: Rules must be ‘safe’
      Results
      Scales over small regions
      Useful for detecting inconsistencies in data
      More research needed
      More efficient programs
      Use of relational database backend
      Further exploration of ASP semantics
      Genomic rules have many exceptions
    • 55. Prolog implementation
      Removes:
      rules that cause cycles with backtracking
      Implementation
      Optional use of Nested Containment List library (C + SWI FLI)
      Results
      Results can be incomplete due to missing rules
      E.g. intron :- exon, but not exon :- intron
      Ruleset can be tailored for dataset
      Scales over medium sized datasets
    • 56. Hybrid Prolog-Relational implementation
      Uses same program as prolog implementation
      Relational database store facts (extensional)
      can be distributed
      Uses sql_compiler + mappings to genomics databases
      Ensembl
      Chado
      Non-recursive prolog rules dynamically translated to complex SQL
      Recursive subclass rules translated
      by query compiler using UNIONs
      precomputed and stored in relational database
      Scales to full genomes
    • 57. LP for genomics: conclusions
      No one paradigm is perfect
      Many axioms cannot be expressed in OWL
      but tools are good
      Disjunctive Datalog good for consistency checking in small regions
      More research required on efficiency of tabling solution, ASPs
      WAM solution most efficient
      Manually rewriting programs is tedious!
      Hybrid solutions useful
      RDBs for asserted facts
    • 58. Application: match.com for diseases
      Organisms have phenotypes
      characteristics under the control of the genes of that organism
      Related genes can have similar phenotypic effects
      even when the least common ancestor of the gene is 500m years ago
      Finding these genes can help understand
      disease
      evolution
    • 59. Application: match.com for diseases
    • 60. Semantic Similarity
      Given a collection of
      features F = {f1, f2, …}
      attributes A = {a1, a2, …}
      feature-attribute mappings:
      a(f) = F x A
      For any feature pair x,y, calculate:
      Jacard coefficient
      |a(x) ∩ a(y)| / |a(x)∪ a(y)|
      maximum IC
      IC(a) = -log2p(a)
      maxIC(x,y) = Max[IC(a) : a ∈a(x)∩ a(y)]
    • 61. SWI-Prolog implementation
      Uses GMP
      normal prolog programs have unbounded integer arithmetic
      allows fast bitwise implementations of set intersection/union
      Encode feature attribute lists as integers
      m : A  {0, .., |A|-1}
      ai(f) = ∑ 2 m(a) a ∈ a(f)
      Set intersection and union computed using bitwise and/or
      Fast implementation of Jacard coefficient
      J is (A1 / A2 / A1 / A2)
    • 62. Similarity metrics + reasoning
      Attributes are description logic class expressions
      rarely exact matches across species
      a(human1)
      a(zebrafish7)

      dystrophic∩
      ∃quality_of. arm_muscle
      atrophied∩
      ∃quality_of.pectoral_fin_muscle
      a(human1) ∩ a(zebrafish7) = {}
    • 63. Use reasoning to find subsumer
      Find Least Common Ancestor expression
      typically class expression, not named class
      a(human1)
      a(zebrafish7)
      decreased_size∩ ∃quality_of. muscle_of_upper_limb
      dystrophic∩
      ∃quality_of. arm_muscle
      atrophied∩
      ∃quality_of.pectoral_fin_muscle
      a*(human1) ∩ a*(zebrafish7) = {decreased_size∩ ∃quality_of. muscle_of_upper_limb}
    • 64. Implementation: Uses Thea2
      Thea2 is a prolog package for OWL2
      http://github.com/vangelisv/thea
      reads/writes
      RDF/XML
      OWL-XML
      Native prolog form
      Description Logic Programs (DLPs)
      Reasoning strategies
      Prolog
      DL reasoners (via JPL/OWLAPI)
      SQL DB + forward chaining
    • 65. Phenotype matching: Results
      Proof of concept on 10 human disease genes
      publication forthcoming
      Currently applying to neurodegenerative diseases
      Funding to extend to all Mendelian diseases
    • 66. Web Applications
      http://berkeleybop.org/obo
      Web interface to Open Bio Ontologies
      Implemented in perl + SWI-Prolog
      Prototype for future development
      SWI-Prolog
      Production version in perl and/or java
    • 67. Experiences using LP for bioinformatics: conclusions
      A little bit of LP goes a long way
      The theory-application gap is largely untapped
      A variety of LP paradigms are useful
      ASP, datalog, DLs, prolog, ILP, …
      Interoperation can be hard!
      LP for ‘real world’ applications
      It is possible!
      Declarative approach arguably superior
      Web/database applications are a sweet spot
      We need to show more success stories
      ..and to dispel myths
    • 68. Recommendation: make it easier for users
      Documentation:
      Unify community knowledge in a single wiki
      Create a general LP mail list
      c.f. OWL/SemWeb community
      Tools:
      Program analysis
      Lint-like tool for tabled prologs, ASP
      Visualization
      Libraries
      CPAN for Prolog
    • 69. Recommendation: make it open-source
      Why
      Encourages collaboration
      Bioinformaticianslove open source
      The people who fund bioinformaticians love open source
      Open source can still generate revenue
      How
      Deposit code in open source code repositories
      github, sourceforge, googlecode, etc
      Embrace Web 2.0
      blog it, put it on a wiki
    • 70. Recommendation: interoperate with RDBs
      Why?
      RDBs and LP should be a natural match
      Application developers are conservative and familiar with RDBs
      lightweight in-memory embedded RDBs are becoming more popular
      How:
      Hide LP systems behind pseudo-SQL interface
      SQL queries and DDL translated behind the scenes. cfsql_compiler
      Users can use native LP syntax and semantics as they feel comfortable
      Embed LP systems directly in RDBs
      E.g. PostgreSQL extensions
      Improve prolog->SQL interfaces
      Common API c.f. JDBC (Java), DBI (Perl)
    • 71. Recommendation: A unified API to all LP systems
      Use case:
      calling LP system from host language (java, perl, ruby, even other prolog)
      Problem:
      No standardization amongst APIs
      Analagous problem:
      RDB APIs
      Solved: a 20th century problem
      Proposal:
      Common REST interface
      Single interface per host language
    • 72. Interoperation between LP systems
      LP systems (ILP, ASP, Prolog, …) differ in whether they accept:
      Foo(x).
      ‘Foo’(x).
      ‘foobar’(x).
      foo(‘xy’).
      foo(“xy”).
      Non-prolog systems should:
      Adhere to ISO standard for intersection with pure prolog
      Or at least provide ISO mode
      Also:
      ISO Common Logic
      W3C RIF
    • 73. Future directions
      Scalable LP
      Probabilistic + logic modeling
      CLP(Bayes)
      PRISM
    • 74. Robot scientist
      The Automation of Science
      King et al.
      Science 3 April 2009: 85-89
      DOI: 10.1126/science.1165620
      http://news.bbc.co.uk/2/hi/science/nature/7979113.stm
    • 75. Acknowledgments
      Vangelis Vassiliadis (Thea)
      Stephen Veitch (intervaldb)
      ChristophDraxler (sql_compiler)
      Jan Wielemaker + SWI Mail list
      Paulo Moura
      Vítor Santos Costa + Yap developers
      Terrence Swift + XSB developers