Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

KnetMiner Overview Oct 2017


Published on

KnetMiner, with a silent "K" and standing for Knowledge Network Miner, is a suite of open-source software tools developed at Rothamsted Research for integrating and visualising large biological datasets in order to accelerate gene discovery. The software mines the myriad databases that describe an organism’s biology to present links between relevant pieces of information, such as genes, biological pathways, phenotypes or publications. The aim is to provide leads for scientists who are investigating the molecular basis for a particular trait or ways of improving the organism’s performance in some way

Published in: Science
  • Be the first to comment

  • Be the first to like this

KnetMiner Overview Oct 2017

  1. 1. KnetMiner – Knowledge Network Miner Keywan Hassani-Pak @KnetMiner
  2. 2. About Rothamsted Research • Rothamsted is the longest running agricultural research station in the world (est. 1843) • Strategic research to address global food security demands • Improving crops to be tolerant to drought, heat and pests while still providing optimum nutrition • Interdisciplinary research from gene to field
  3. 3. Outline 1. Routes to candidate gene discovery 2. Building genome-scale knowledge networks 3. Overview and demo of KnetMiner 4. Extending networks with text-mining 5. Candidate gene prioritization 6. Discussion
  4. 4. Routes to candidate gene discovery
  5. 5. Routes to candidate gene discovery Many gene discovery routes can identify candidate genes for complex traits Gene Expression Genetic Methods Candidate Genes Prioritization Validation Markers for Diagnosis, Breeding, GM Phenotype 1 2 Research Literature Published Data 3 4
  6. 6. Quantitative Trait Locus (QTL) Mapping 1. Developing of experimental population 2. Collection of phenotypic and genotypic data 3. Construction of linkage map 4. Correlation of marker/trait 5. Identification of QTL 1 QTL region can encompass 10s to 100s of genes. How to prioritize them?
  7. 7. Genome Wide Association Studies (GWAS) FLC gene expression (FLC) Leaf Number (LN22) Atwell et al., Nature 2010 • GWAS results can be simple and complex to interpret • Peaks can be diffuse covering several hundred kb without a clear centre • Causal polymorphisms have not always strongest association AvrRpm1
  8. 8. Gene expression analysis 2 different tissues or genotypes time points after infection or treatment Genes • Gene expression studies can be complex to interpret • 100s to 1000s of differentially expressed genes that are somehow related to phenotype • What are the key pathways leading to observed phenotypes?
  9. 9. Text Mining - Trait and Gene Functions • Publications (free text) are most up-to-date resource for information • Finding sentences that link phenotypes (flowering time) and gene function (circadian clock) to genes (CONSTANS) • Term variability and ambiguity can produce missing or false associations 3
  10. 10. Life Sciences Databases • Plethora of public Life Sciences databases in various formats • Databases constantly growing in size and content • Challenging to keep up-to- date with growing body of knowledge 1500+ databases published in NAR 4
  11. 11. Which associations (genes) are worth following up? Often a highly subjective decision Evaluation of all available information is expensive How is genotype translated to phenotype? Often involves direct and indirect interactions Data integration and knowledge discovery is technically challenging
  12. 12. Building genome-scale knowledge networks
  13. 13. Biological knowledge network/graph Genotype • QTL • GWAS Omics • Transcriptomics • Proteomics • Metabolomics Phenotype • Disease • Development • Stress tolerance Biological Knowledge Network • Prior knowledge • Structured, unstructured data • Cross-species data IntegrationIntegration
  14. 14. The approach is generic and works similarly for other species
  15. 15. Ondex – Data Integration Platform • Free and open source • Data warehousing using a graph-database • Platform to integrate public and private datasets in various formats • Provides a GUI, CLI, APIs and workflows for reproducible data integration Ondex
  16. 16. Let’s start with some GWAS data… Example Arabidopsis #SNP=66,816 | #Gene=27,502 | #Phenotype=107
  17. 17. … transform into a network (SNP) (Phenotype) associated
  18. 18. Biological interaction datasets
  19. 19. (SNP) (Phenotype) associated … add biological interactions
  20. 20. … add differential gene expression data early vs late flowering
  21. 21. … add other linked data • Gene-GO • Gene-Phenotype • Gene knock-out or overexpression • Text mining publications • Gene-Publication • Gene-Pathway • Gene-Expression • Protein-Small Molecule • Homology to other species >800k nodes >3 million edges Genome-Scale Knowledge Network (GSKN)
  22. 22. Same principles for other species Knowledge graph of LRRK2 human gene
  23. 23. How to search and interpret too much information? • Methods needed to evaluate millions of relationships in knowledge network, prioritize genes and extract relevant subnetworks • Interactive and exploratory tools needed to enable knowledge discovery and decision making • Interpretation should be the task of domain experts i.e. biologists!
  24. 24. Overview and Demo of KnetMiner
  25. 25. Web Browser Server Servlets and JSP Page Java Socket Knowledge Graph DBOndex API JavaScript Apache Tomcat Multithreaded Java Server HTML, JSON, XML and images over HTTP via Ajax Views Java Socket KnetMiner System Overview Client Client • Compatibility with all major web browsers • Based on D3.js, cytoscape.JS, node.JS • Interactive and touch-enabled Server • Fast and scalable Java multi- threaded server • Pre-indexing of knowledge graph • Scoring and information extraction
  26. 26. KnetMiner UI Overview Search Select Explore
  27. 27. Google-like search interface Search knowledge graph using trait- based keywords Real-time user feedback and query suggestions Trait related keywords Query term suggestions
  28. 28. KnetMiner Map View (GenoMaps) New touch-friendly web App for Map View in KnetMiner Visualize genes, SNP, QTL, GWAS data. Select genes within QTL regions and overlapping with SNP’s and explore their network
  29. 29. KnetMiner Network View (KnetMaps) Touch-friendly web App for Network View in KnetMiner Explore networks linking genes to proteins, SNPs, phenotypes, publications, etc.
  30. 30. Extending knowledge networks with text-mining
  31. 31. Text-mining workflow in Ondex • Ondex plugins to extract structured information from unstructured free text • Developed workflows to enrich knowledge networks with novel links using the scientific literature Import •Ondex Graph •PubMed •Ontology •Tabular Mapping •NER-method •Concept Class Transformer •Publication •Abstract •Sentence Filter •Relation Type •Attribute Value •Unconnected Export •OXL •RDF •JSON Hassani-Pak et al., JIB 2010
  32. 32. Ondex text-mining method Input data • 27,416 Arabidopsis gene names from Phytozome • 52,561 Abstracts from PubMed that contain Arabidopsis • 22,201 curated citations from TAIR • 1,349 Trait Ontology terms from Planteome text-mining x y BA occurrs_in Publication Concepts published_in weighted association network IP=1.7; M=1.2; N=2 yx BAGeneTO TO Hassani-Pak et al., JIB 2010
  33. 33. Text-mining output These steps connect 5553 Arabidopsis genes to 409 TO terms based on 18,341 co-citations (12,190 on sentence level)
  34. 34. Text-mining discussion • TM method is flexible and can easily enhance data integration workflows and knowledge networks • TM is one of many evidence types in a knowledge network • TM provides access to brand-new information that is not yet available in structured databases • Modest post-TM-filtering is required to retain high-quality relations • TM for gene-phenotype adds 12k high-quality relations that were previously absent in the knowledge network
  35. 35. Candidate Gene Prioritization
  36. 36. Definition of gene-evidence network 1. Gene-evidence network: Biologically plausible paths (semantic motifs) starting with a Gene node and ending with Evidence nodes, e.g. 57 semantic motifs were defined in the wheat network 2. Gene-evidence networks are extracted using the Metadata-based Graph Query Engine (Hindle 2012) 3. Evidence nodes can be part of one (high specificity) or many (low specificity) gene-evidence networks • Gene-evidence network of Gene X contains 5 nodes • Neighbourhood network (n=3) of Gene X contains 9 nodes X
  37. 37. Searching gene-evidence networks for keywords 1. Knowledge graph indexed and searched for user search terms using Lucene 2. A proportion of nodes in the gene-evidence network can contain the search term auxin cytokinin strigolactone CCD MAX subapical shoots axillary branching shoot branching pathway X Gene-evidence network User search terms Gene
  38. 38. Gene scoring function (KNETScore) 1. Uses TF*IDF (Sparck & Jones, 1972) to rank documents in gene- evidence network by their relevance to a search term 2. Uses the specificity of documents to a gene (IGF: Inverse Gene Frequency) 3. Uses the frequency of evidence concepts, normalised by size of gene- evidence network (EDF: Evidence Document Frequency) 4. Calculates KNETScore (TFIDF*EDF*IGF) for every gene
  39. 39. Gene ranking – Example Score: 5.72 Score: 2.71 … the left gene scores higher because it has a smaller gene-evidence network and more specific evidence documents Two genes have a similar number of evidence documents containing the search terms…
  40. 40. Discussion – Candidate Gene Prioritization • In use case study KNETScore ranked causal gene in 3rd place out of 75 genes within a petal size QTL • High overlaps between KnetMiner top 100 genes for “gibberellin” and “lipid” search terms with curated gene lists • Smart pre-indexing of the knowledge network has reduced the computation of the score from O(2n(|V|+|E|)) to O(1) • Many ways to improve the scoring function, e.g. using weights for different evidence types, distance of evidence to gene and edge-attribute information
  41. 41. Summary • Web application for very fast search of large genome-scale knowledge graphs • Ranking of candidate genes based on knowledge mining • Interactive visualisation of genome and knowledge maps • Facilitates knowledge discovery and hypothesis generation
  42. 42. KnetMiner – Makes Gene Discovery Faster & Fun International academic collaborations Interest from industry and start-ups
  43. 43. KnetMiner 2.0 – BBSRC BBR (GCRF) Proposal SNP-Seek Genetic diversity Novel traits Phenotype data KnetMiner 2.0 Interactions Pathways Literature Scientist/Breeder Novel genes Better crops Faster discoveries Ensembl Plants Reference genomes Model Species Homology data Data Information Knowledge Insight A pangenomic and network based approach to search for novel genes and clues to design better rice varieties.
  44. 44. Acknowledgements John Doonan Sergio Feingold Martin Castellote Uwe Scholz Matthias Lange Keywan Hassani-Pak Ajit Singh Marco Brandizi Monika Mistry Lisa Lill Chris Rawlings Dave Edwards Philipp Bayer Misha Kapushesky Kevin Dialdestoro @KnetMiner Jan Taubert Artem Lysenko Matthew Hindle Catherine CanevetRamil Mauleon Kenneth McNally Nickolai Alexandrov Andy Law