Exploiting semantic networks of public data for systems chemical biology
Upcoming SlideShare
Loading in...5

Exploiting semantic networks of public data for systems chemical biology






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Exploiting semantic networks of public data for systems chemical biology Exploiting semantic networks of public data for systems chemical biology Presentation Transcript

  • Indiana University School of Informatics and Computing Exploiting semantic networks of public data for systems chemical biology“Information is cheap. Understanding is expensive” (Karl Fast) David Wild, http://djwild.info Assistant Professor and Director, Cheminformatics & Chemogenomics Research Group (CCRG) Indiana University School of Informatics and Computing djwild@indiana.edu
  • What do we mean by system? For our purposes, the network of relationships of chemicals, drugs, targets, genes, expressionprofiles, pathways, (publications), diseases and side- effects in the body
  • What’s a semantic network?A network of nodes and edges represented in RDFformat, annotated with node / edge labels using an ontology (OWL) Stored in an RDF triple store Searchable using the SPARQL query language
  • What we have created at Indiana A Semantic Linked Dataset called Chem2Bio2RDF that integrates multiple public experimental and literature-derived datasets relating to chemical compounds, drugs, targets, genes, expression profiles, pathways, diseases and drug side effects A set of semantic algorithms and tools for visualizing, analyzing and predicting relationships in the semantic linked data and relating this data to publications
  • Semantic Technologies: an enabler for integration Allows simple, flexible description of heterogeneous graphs of data relationships (RDF), optionally following the rules of an ontology (OWL) Strengths  Merging datasets and moving data between repositories is technically straightforward – dataset mappings are themselves described in RDF (and OWL). RDF and OWL are highly standardized and allow precise representation  Powerful cross-dataset searching with SPARQL  Increasing availability of powerful off-the-shelf searching and visualization tools (TopBraid, etc)  Allows application of graph theory algorithms to data  Can express data provenance in RDF Weaknesses  Just emerging from early adopters phase – received bad press in pharma as hyped too early  Triple stores historically less efficient than relational DBMSs (but rapidly changing)  Most focus has been on data and integration rather than algorithms to use the data  Difficulty weighting edges in a relational graph Systems Chemical Biology and Semantic Technologies map quite nicely as they are both about complex networks http://blog.project- sierra.de/archives/1639
  • Current activity in Semantic Web & Drug Discovery OpenPHACTS (www.openphacts.org)  >€10m European project to create an “open pharmacological space” (OPS) using Triple Stores and Semantic Web technologies  Chem2Bio2RDF has been integrated into the OPS SWHCLSIG (http://www.w3.org/wiki/HCLSIG)  W3C special interest group, have created BioRDF (RDF representations of biological data) and LODD (Linking Open Drug Data) CSHALS (http://www.iscb.org/cshals2011)  Conference on semantics in healthcare and the life sciences Pistoia Alliance (http://www.pistoiaalliance.org/)  Industry alliance for collaboration and integration of drug discovery data JCI RDF in chemistry (http://www.jcheminf.com/series/acsrdf2010)  Journal of Cheminformatics thematic series
  • Systems chemical biology + Semantic Web Drug Discovery Today, 2012, in press.
  • Big Data in the public domain There is now an incredibly rich resource of public information relating compounds, targets, genes, pathways, and diseases. Just for starters there is in the public domain information on:  ~30 million compounds and ~500,000 bioassays (PubChem, ChemSpider)  ~60 million compound bioactivities (PubChem Bioassay, ChEMBL, Matador, etc)  ~5,000 drugs (DrugBank)  ~9 million protein sequences (SwissProt) and ~60,000 3D structures (PDB)  ~14 million human nucleotide sequences (EMBL)  ~20 million life science publications (PubMED)  Multitude of other sets (drugs, toxicogenomics, chemogenomics, metagenomics …)
  • Chem2Bio2RDF – www.chem2bio2rdf.org Semantically integrates 42 heterogeneous public datasets related to drug discovery in a fast Virtuoso triple-store with SPARQL endpoint (linked from main site) Datasets cover chemistry, chemogenomics, biology, systems & pathways, pharmacology, phenotypes, toxicology, glycomics and publications, and biological entities of compounds, drugs, targets, genes, pathways, diseases and side-effects Major datasets include PubChem, ChEMBL, DrugBank, PharmGKB, BindingDB, STITCH, CTD, K EGG, SWISSPROT, PDB, SIDER, PubMed. Full set at http://chem2bio2rdf.wikispaces.com/Datasets Holds data on ~31m chemical structures, ~5,000 marketed drugs, ~59m bioactivity data points and ~19m publications Linked into LOD cloud, and may form part of OpenPHACTS repository Permits SPARQL searching using Chem2Bio2OWL ontology. For more information, see BMC Bioinformatics 2010, 11, 255.
  • Representing inter-dataset relationships in RDF RDF describes noun-verb-noun relationships in many formats  CETRORELIX is_active_against HCGR  <CETRORELIX> <is_active_against> <HCGR> Many types of relationship can be described, including heterogeneous ones. Power is increased using URI’s and ontologies (OWL)  URI gives unique identifier for a noun or verb clause  http://chem2bio2rdf.org/drugbank/resource/drugbank_interaction/269  Same As relationship can map equivalent items in different datasets  Ontology describes valid nouns and verbs, and can describe equivalent classes A set of RDF statements comprises an RDF graph  Nodes and edges can be labeled but not cleanly weighted (although a weighting ontology does exist)
  • Example RDF relationships in Chem2Bio2RDF
  • Chem2Bio2OWL – Semantic annotation Ontology describes meaning independent of dataset. Data dependent relationships are then mapped to classes in the ontology (“annotation”). Example: the drugbank:DrugBankTarget maps to “Binding” class Described in Chen et al., Journal of Cheminformatics, 2012, 4:6 and http://chem2bio2owl.wikispaces.com Fills a gap in current ontologies: covers relationship of chemical compounds and drugs to targets, genes, assays and side-effects Aligned with other ontologies: released on NCBO Bioportal (http://bioportal.bioontology.org/ontologies/1615) Simplifies SPARQL searching by integrating equivalent classes across datasets (no longer need to explicitly specify datasets and fields) Increases power of SPARQL searching allowing inclusion of data and relational classes (e.g. activator vs antagonist)
  • SPARQL – a semantic query languagePREFIX pubchem:<http://chem2bio2rdf.org/pubchem/resource/>PREFIX kegg: <http://chem2bio2rdf.org/kegg/resource/>PREFIX uniprot: <http://chem2bio2rdf.org/uniprot/resource/>SELECT ?compound_cid (count(?compound_cid) as?active_assays)FROM <http://chem2bio2rdf.org/pubchem>FROM <http://chem2bio2rdf.org/kegg>FROM <http://chem2bio2rdf.org/uniprot>WHERE {?bioassay pubchem:CID ?compound_cid .?bioassay pubchem:outcome ?activity . FILTER (?activity=2) .?bioassay pubchem:Score ?score . FILTER (?score>50) .?bioassay pubchem:gi ?gi .?uniprot uniprot:gi ?gi . ?pathway kegg:protein ?uniprot . ?pathway kegg:Pathway_name ?pathway_name . FILTERregex(?pathway_name,"MAPK signaling pathway","i") .} GROUP BY ?compound_cid HAVING (count(*)>1)
  • Semantic network algorithms & tools from IU Association Search – visualize literature supported associations between any two entities (compound, drug, gene, pathway, disease, side effect). PLoS One, 2011, 6(12), e27506. Semantic Link Association Prediction (SLAP) – find most highly associated entities (compound, drug, gene, pathway, disease, side effect) to any other entity, based on probabilistic weightings of graph edges based on public experimental datasets. PLoS Computational Biology, in review. BioLDA – find most highly associated entities to any other entity based on a complex topic model analysis of the literature (PubMed). PLoS One, 2011, 6 (3), e17243 See also: WENDI (J. Cheminf., 2010,2,6); Chemogenomic Explorer (BMC Bio. 2011,12,256), ChemLDA, ChemBioGrid (J. Chem. Inf. Model., 2007; 47(4) pp 1303-1307) All algorithms and tools available on http://djwild.info
  • Association SearchIdentifies genes specifically involved in the relationship of drugRosiglitazone and side-effect Myocardial Infarction. Shown paths are frompublic datasets but have support in the literature (via BioLDA)
  • Ibuprofen and Parkinson’s Disease Identified 70 genes associated with Ibuprofen and Parkinson’s disease, 9 of which are related to inflammation (IL1A, IL1B, IL1RN, IL6, LTA, N FKB1, NFKBIA, PTGS2, TNF) Clear direct association between PTGS2 (COX2) and Parkinson’s Disease via CTD (leading to literature) Single gene, AMBP, differentially associated with Ibuprofen and Parkinson’s Disease but not with other NSAIDS (AMBP has shown potential as a Parkinson’s biomarker)
  • Thiazolinediones and Myocardial Infarction Gene/Drug Rosi- Tro- Pio- glitazone glitazone glitazone SAA2 Strong V. weak V. weak “Discussed” PharmGKB APOE Strong V. weak V. weak “Discussed” PharmGKB + Matador ADIPOQ Strong V. weak Strong Positive Positive PharmGKB PharmGKB CYP2C8 Strong V. Weak Strong Changes Changes metabolism (CTD) metabolism (CTD)
  • APOE, ADIPOQ, LDL, HDL, Rosi and PioPioglitazone AND Rosiglitazone increase ADIPOQ which Rosiglitazone only interacts with APOE and results in anresults in increased HDL (good) cholesterol increase in LDL (bad) cholesterol
  • Semantic Link Association Prediction (SLAP) Predicts a probability of association of a compound and a target based on the network paths between them that involve drugs, targets, pathways, diseases, tissues, GO terms, chemical ontologies, substructure and drug side-effects It can be primarily considered as a “missing link prediction” Data source is a subset of the Chem2Bio2RDF network including 250,000 compounds with known bioactivities and the targets known to associated with these drugs Raw Score is a measure of the significance of a single path between a compound and target, based on topology and semantics of the path nodes and edges. Raw scores are normally distributed within a path pattern Association Score is a sum of z-scores of raw scores relative to a distribution of random pair scores for different paths and path patterns. Association scores form a normal distribution Association Significance is a significance p-value of an association score based on the normal distribution of association scores.
  • Example: Troglitazone and PPARG Association score: 2385.9 Association significance: 9.06 x 10-6 => missing link predicted
  • SLAP web tool http://chem2bio2rdf.org/slap
  • SLAP – target profile for IbuprofenCOX2 – maintargetRegulateneurotransmitter release COX1 Dopamine receptor Seratonin receptorsCannaboidreceptorsMuscarinicreceptor(motor control) vs acetaminophen and aspirin
  • SLAP - Biologically Similar Drugs Dopamine agonist, used in Parkinson’s Dopamine agonist, used in Parkinson’s
  • Troglitazone vs Rosiglitazone
  • Chemogenomic Explorer Uses WENDI (J Cheminf., 2010, 2:6 ) web service to generate XML of related biological data for a compound using Chem2Bio2RDF XML is converted to RDF with a WENDI ontology Applies RDF inference engine and rule set to infer compound-disease relationships based on evidence paths (e.g. similar compound is active in an assay associated with a gene which is associated with a disease). These are represented as new RDF. Facet browser allows clustering, filtering and exploration of evidence paths by disease association For more information, see Zhu, Q. et al., BMC Bioinformatics, 2011, 12:256
  • Chemogenomic Explorer Interface
  • BioLDA – semantic Bioterm literature extraction
  • BioLDA Topic Model of PubMed Literature Latent Dirichlet Assoication (LDA) identifies “latent topics” by word association: a kind of fuzzy clustering. Each word can have associations with multiple topics, and has a varying degree of strength Term-topic edges are labeled with probability (i.e. strength of a relationship to a topic). Term-term edges are labeled with KL-divergence (measure of distance) Considered BioTerms rather than free text, and applied to 336,899 MedLine abstracts on 50 topics published in 2009 Based on work done by Jie Tang on social networks (see www.arnetminer.com) More information can be found in PLoS One, 2011, 6 (3), e17243
  • Example: Topic 10
  • BLASC Calculates KL-Divergence score for any bioterm pairs (drugs, genes, side- effects, pathways, etc) Available from http://djwild.info
  • Applications in drug discovery processes Integrative virtual screening  SLAP / BLASC association with targets and/or known ligands  Comparison with QSAR models, LBVS, Docking and Pharmacophore search  Harmonic data fusion  Applied to PXR antagonists (Univ. Cincinatti) and Mycobacterium Tubercolusis inhibition (OSDD) Polypharmacology  Drug indication network based on SLAP target profiles  Adverse effect network based on SLAP off-target profiles Searching and exploring mechanisms of action  Association search, BLASC, SLAP  Examples tested: Thiazolinediones and Myocardial Infarction; Ibuprofen and Parkinson Disease Other work in progress  Improvement of SLAP algorithms  Mapping of patient and metagenomics data
  • Try the tools out – djwild.info
  • Cheminformatics Education at Indiana University Residential Ph.D. program in Informatics with a Cheminformatics specialty Distance Graduate Certificate program in Chemical Informatics http://djwild.info/ed Free cheminformatics learning LuLu eBook - $29 resources http://slg.djwild.info http://icep.wikispaces.com