Experiences applying logic programming in bioinformatics

Experiences using logic programming in bioinformatics Chris Mungall Berkeley Bioinformatics and Ontologies Group http://berkeleybop.org Lawrence Berkeley National Laboratory ICLP 2009

Outline Biology and biological data integration: a brief introduction Obol: First experiences applying LP Blipkit: a reusable bioinformatics developer’s toolkit Modular structure I/O and relational database connectivity Some applications of Blipkit and LP Genes and genomics Phenotype matching Web applications Conclusions Where next? Some recommendations for the LP community

The promise and challenges of biological research Why study biological systems? Because they’re fascinating Improve health Improve the environment BUT: Biology is hard Biological systems are extremely diverse Biology deal with phenomena at multiple levels of granularity There is a deluge of data Bioinformatics Biology as an information science Computational methods vital to understanding

Diversity of biological systems

Biology in the small: Molecules DNA RNA pseudoknot

Cells and organismal biology gastrulation blastula gastrula axon terminal dendrite node of Ranvier soma axon schwann cell cell nucleus myelin sheath

bio-databases 1200 Biological Databases published in Nucleic Acids Research many more unpublished many of these are database federations (e.g. Ensembl) Heterogeneous systems Storage mechanism: Relational XML Flat files Ad-hoc, semi-structured, natural language Limited APIs lack of standards limited query expressivity Poorly integrated Limited integration beyond identifier cross-references Users must manually integrate Bioinformatics runs on perl glue metabolic pathways mutants genes fruit flies tumors

Data interrogation and discovery Sample of tasks Find mutations in regions upstream of neurotransmitter-producing genes Find drug targets or animal models for neurodegenerative diseases What biological pathways are enriched in high acidity environments? Answer each of these is difficult Manual aggregation from lots of databases Various kinds of inference required

OBO: Open Biological Ontologies small large

Obol: First experience with LP in bioinformatics Problem Many existing bio-ontologies were in fact more like terminologies Basic axioms, is_a hierarchies Deeper logical structure implicit in terms Long noun phrases, recursively composed “regulation of transcription during G1 phase of mitotic cell cycle” Existing solutions (2004) Take advantage of semi-controlled syntax of terms Parse using ad-hoc regular expressions Influence of perl in bioinformatics! But context-free grammars (at least) were required

A better solution: Definite Clause Grammars Obol: A collection of domain specific DCGs Significant improvement over perlRegExs Declarative More expressive Integration with simple reasoning Bi-directional: can be used for term generation from logical expressions

Example process grammar process(P) regulation(P) | specification(P) | transcription(P) | ... process(P and during(W)) process(P),[during],process(W). process(P andpart_of(W)) process(P),[of],process(W). regulation( regulates(P) )  [regulation,of],process(P). specification( specifies(C) )  [specification, of], cell(C). cell(C and part_of(O)) ogan(O),cell(C). “regulation of transcription during G1 phase of mitotic cell cycle”  regulates(transcription) and during(g1_phase and part_of(mitosis)) “regulation of transcription from RNA polymerase II promoter involved in ventral spinal cord interneuron specification”  regulates(transcription and has_signal(rna_pol_ii)) and part_of(specifies(interneuron and part_of(ventral_spinal_cord)))

Implementations Obol v1 : 2005 XSB DCGs + tabling Earley / chart parsing Basic ontology reasoning (tabling to avoid cycles) Integration into java editing environment (XSB interprolog) Obol v2 : 2006 Port to SWI-Prolog Web interface Earley algorithm implementation Backward chaining for simple reasoning Forward chaining for full reasoning Obol v2.5 : 2007 Reversion to plain DCGs careful construction to avoid cycles ,[object Object]

Built on Thea2http://wiki.geneontology.org/index.php/Obol

Results Obol grammars applied successfully to generate axioms for multiple ontologies particularly the Gene Ontology Still used frequently Lessons learned Small amount of basic LP goes a long way LP techniques not widely known in bioinformatics Different LP systems have different strengths Choosing between them is hard – and frustrating

Could LP prove as successful in the wider bioinformatics arena? Rule-based analysis pipelines prolog > make Integration of ontology reasoning and database queries prolog > datalog > sql Pathways graphs, ASP Genomics Linear transformations, CLP Phylogenetics operations on trees

Toolkit Paradigm: BioPerl http://www.bioperl.org/ Established 1990s Collaborative Open Source, svn repository No funding, all voluntary Modular Namespaces Interrelated Separation of I/O from models Parsers Writers SQL database bindings Publication: The BioPerl toolkit, Stajich et al, Genome Research 2002 1044 citations (google scholar) ,[object Object]

open bioinformatics foundation

Anatomy of a blip domain package Model(s) of the domain dependencies to other domain modules extensional and intensional predicates I/O parsers/writers for small subset of bioinformatics file formats DCGs or external perl translators for common XML schemas Native prolog serialization of model ‘for free’ Web UI Bridges Relational Other prolog models Ontology models

Domain model modules A model consists of extensional + intensionalpredicates Extensional predicates Unit clauses / facts - Asserted and/or compiled from fact files Akin to relational tables Intensional predicates Declarative: No I/O side effects Prolog has no built in extensional/intensional distinction All clauses treated equally Facts conventionally declared dynamic/1 and multifile/1 Some metamodeling is useful Easy to roll own A standard metamodel module would be useful optional type system + relational DDL style constraints Works as documentation

Example from systems biology model %%reaction_modifier(?R,?P) is nondet % relation between a biochemical reaction and a molecular constituent that plays a role in the process but is unmodified :- extensional(reaction_modifier/2). % --- INTENSIONAL PREDICATES --- %%derivation_link(?Input,?Output,?Via) % two species directly linked via a connecting % reaction (excludes modifiers) derivation_link(Input,Output,R):- reaction_reactant(R,Input), reaction_product(R,Output). %...[snip]… :- module(sb_db,[ reaction_product/2, reaction_reactant/2, reaction_modifier/2, derivation_link/3, …]). :- use_module(bio(dbmeta)). % metamodel %%reaction_product(?R,?P) is nondet % relation between a biochemical reaction and a molecular constituent produced in the reaction :- extensional(reaction_product/2). %% reaction_reactant(?R,?P) is nondet % relation between a biochemical reaction and a molecular constituent that is consumed in the reaction :- extensional(reaction_reactant/2).

Integrating with relational databases Most biological data stored in relational databases Many provide open SQL ports for distributed queries RDBs scale well with large quantities of data …but RDBs lack necessary deductive capabilities Expressivity Hierarchy FOL Pure prolog Datalog Relational Model Using prolog with RDBs should be easy… right?

sql_compiler Given a mapping to a relational schema: rewrites prolog terms as SQL queries Used in conjunction with db connectivity module History Draxler, 1992 Source forked, modified versions available with various prologs Blip includes extensions to Rewrite sub-optimal queries Rewrite non-recursive prolog clauses Integrate with SWI ODBC

Example query rewriting program rewriting program ?- sqlbind(sb_db:all, mydb). derivation_link(Input,Output,R):- reaction_reactant(R,Input), reaction_product(R,Output). call goal ?- derivation_link(X,Y) schema metadata + relation(reac_in,2). attribute(1,react_in,reac_id,int). attribute(2,react_in,input_id,int). relation(reac_out,2). attribute(1,react_out,reac_id,int). attribute(2,react_out,output_id,int). query rewriting + SELECT reac_in.reac_id, reac_in.input_id, reac_in.output_id FROM reac_in, reac_out WHERE reac_in.reac_id=reac_out.reac_id; mapping reaction_reactant(R,P) <- reac_in(R,P). reaction_product(R,P) <- reac_out(R,P). odbc.pl

Obtaining data from web services Many large bioinformatics data providers provide RESTful APIs NCBI caBIG SWI libraries used http_client sgml (for parsing XML payloads) XML -> Models Direct translation of sgml too low level XSLT-inspired prolog template-oriented processing language Application: ontology enhanced search term expansion E.g. “find all genes implicated in neurodegenerative disease”  ‘parkinsons’ OR ‘alzheimers’ OR …

Applications of Blipkit and LP techniques Genomics and DNA sequences Deduction of implicit information Consistency checking of genome datasets Phenotype matching Finding similarities of mutational effects

Genome inference Deluge of genomic data Cost per genome decreasing Soon we will all know our genome sequence But what does it mean? Effective use of genomics data relies on deductive inference Many rules are logical: genome calculus Currently encoded using ad-hoc imperative code Probabilistic inference also useful But must be built on top of the logical inference

DNA human chromosome 1: 247m base pairs, 4220 genes Entire genome: 3x109 bps, 20k genes T A G C

DNA human chromosome 1: 247m base pairs, 4220 genes Entire genome: 3x109 bps, 20k genes T A G C Gene expression: transcription splicing translation

Transcription A subsequence of a DNA sequence is transcribed to an RNA sequence regulated by sequence called promoters and enhancers

Splicing Zero or more subsequences (introns) of the RNA sequence are spliced out. The remaining sequences (exons) are joined together at splice sites. ,[object Object]

combinatorial possibilities,[object Object],[object Object]

Genomics databases Genome databases are important for biomedicine understanding evolution in a molecular level Problem: genome databases are incomplete stating all implicit features leads to redundancy integration and complex queries difficult ad-hoc rules embedded in imperative code Problem: genome databases are inconsistent Different interpretation of gene, exon, UTR etc

Solution: Sequence Ontology + Deductive Database The Sequence Ontology standardizes sequence terms Additional axioms are being added Encoding genome calculus Genome relations based on Allen Interval Algebra Can be used in conjunction with a deductive genome database consistency checking does this genome dataset make sense? inference and querying what entities are present in region X?

Sequence relationship predicates based on Allen Interval Algebra no recursion conjunction of binary terms uses arithmetic (for efficiency) Extensions: strands circular genomes upstream_of(X,Y) :- has_end(X,XE), has_start(Y,YS), XE < YS. ?- upstream_of(exon3,X). X=exon1 ; X=exon2 exon3 exon1 exon2 exon4 exon5

Intron-exon inference intron( i(T,S,E) ) :- exon(X1), exon(X2), has_end(X1,S,T), has_start(X2,E,T), ((exon(X3), contained_by(X3,T), starts_after_start_of(X3,X1), ends_before_end_of(X3,X2))). ,[object Object]

possibility of recursion through negationexon(exon1). exon(exon2). has_end(exon1,1000,t1). has_start(exon2,2000,t2). ?- intron(X). X = i(t1,1000,2000) t1 exon1 exon2

OWL implementation Many axioms cannot be expressed in OWL Interval relations – no arithmetic in OWL option 1: use SWRL option 2: enumerate all base pairs and use property chain axioms Cannot infer properties of unnamed individuals E.g. introns from exons Cyclic structures cannot be described Requires Description Graph extension Open World Assumption useful for semantic web CWA is more convenient for genomics

Deductive database implementation Methods: Convert sequence ontology OWL->DLP via Thea2 Manually edit Add rules that cannot be expressed in OWL Tested on XSB and Yap requires tabling Results Currently scales to small regions more debugging required difficult to eliminate unstratified negation

Disjunctive datalog implementation Adds: Constraints Disjunctions in rule heads Implementation DLV-Complex : allows functions in arguments Program written from scratch: Rules must be ‘safe’ Results Scales over small regions Useful for detecting inconsistencies in data More research needed More efficient programs Use of relational database backend Further exploration of ASP semantics Genomic rules have many exceptions

Prolog implementation Removes: rules that cause cycles with backtracking Implementation Optional use of Nested Containment List library (C + SWI FLI) Results Results can be incomplete due to missing rules E.g. intron :- exon, but not exon :- intron Ruleset can be tailored for dataset Scales over medium sized datasets

Hybrid Prolog-Relational implementation Uses same program as prolog implementation Relational database store facts (extensional) can be distributed Uses sql_compiler + mappings to genomics databases Ensembl Chado Non-recursive prolog rules dynamically translated to complex SQL Recursive subclass rules translated by query compiler using UNIONs precomputed and stored in relational database Scales to full genomes

LP for genomics: conclusions No one paradigm is perfect Many axioms cannot be expressed in OWL but tools are good Disjunctive Datalog good for consistency checking in small regions More research required on efficiency of tabling solution, ASPs WAM solution most efficient Manually rewriting programs is tedious! Hybrid solutions useful RDBs for asserted facts

Application: match.com for diseases Organisms have phenotypes characteristics under the control of the genes of that organism Related genes can have similar phenotypic effects even when the least common ancestor of the gene is 500m years ago Finding these genes can help understand disease evolution

Application: match.com for diseases

Semantic Similarity Given a collection of features F = {f1, f2, …} attributes A = {a1, a2, …} feature-attribute mappings: a(f) = F x A For any feature pair x,y, calculate: Jacard coefficient |a(x) ∩ a(y)| / |a(x)∪ a(y)| maximum IC IC(a) = -log2p(a) maxIC(x,y) = Max[IC(a) : a ∈a(x)∩ a(y)]

SWI-Prolog implementation Uses GMP normal prolog programs have unbounded integer arithmetic allows fast bitwise implementations of set intersection/union Encode feature attribute lists as integers m : A  {0, .., |A|-1} ai(f) = ∑ 2 m(a) a ∈ a(f) Set intersection and union computed using bitwise and/or Fast implementation of Jacard coefficient J is (A1 /A2 / A1 A2)

Experiences applying logic programming in bioinformatics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Experiences applying logic programming in bioinformatics

Similar to Experiences applying logic programming in bioinformatics (20)

More from Chris Mungall

More from Chris Mungall (20)

Recently uploaded

Recently uploaded (20)

Experiences applying logic programming in bioinformatics

Editor's Notes