KnetMiner, with a silent "K" and standing for Knowledge Network Miner, is a suite of open-source software tools developed at Rothamsted Research for integrating and visualising large biological datasets in order to accelerate gene discovery. The software mines the myriad databases that describe an organism’s biology to present links between relevant pieces of information, such as genes, biological pathways, phenotypes or publications. The aim is to provide leads for scientists who are investigating the molecular basis for a particular trait or ways of improving the organism’s performance in some way
2. About Rothamsted Research
• Rothamsted is the longest running agricultural
research station in the world (est. 1843)
• Strategic research to address global food
security demands
• Improving crops to be tolerant to drought, heat
and pests while still providing optimum
nutrition
• Interdisciplinary research from gene to field
3. Outline
1. Routes to candidate gene discovery
2. Building genome-scale knowledge networks
3. Overview and demo of KnetMiner
4. Extending networks with text-mining
5. Candidate gene prioritization
6. Discussion
5. Routes to candidate gene discovery
Many gene discovery routes can identify candidate
genes for complex traits
Gene
Expression
Genetic
Methods
Candidate
Genes
Prioritization Validation
Markers for
Diagnosis,
Breeding, GM
Phenotype
1
2
Research
Literature
Published
Data
3
4
6. Quantitative Trait Locus (QTL) Mapping
1. Developing of experimental population
2. Collection of phenotypic and genotypic data
3. Construction of linkage map
4. Correlation of marker/trait
5. Identification of QTL
1
QTL region can encompass 10s to
100s of genes. How to prioritize
them?
7. Genome Wide Association Studies (GWAS)
FLC gene expression (FLC)
Leaf Number (LN22)
Atwell et al., Nature 2010
• GWAS results can be simple
and complex to interpret
• Peaks can be diffuse covering
several hundred kb without a
clear centre
• Causal polymorphisms have
not always strongest
association
AvrRpm1
8. Gene expression analysis 2
different tissues or
genotypes
time points after infection
or treatment
Genes
• Gene expression studies can
be complex to interpret
• 100s to 1000s of differentially
expressed genes that are
somehow related to
phenotype
• What are the key pathways
leading to observed
phenotypes?
9. Text Mining - Trait and Gene Functions
• Publications (free text) are
most up-to-date resource for
information
• Finding sentences that link
phenotypes (flowering time)
and gene function (circadian
clock) to genes (CONSTANS)
• Term variability and ambiguity
can produce missing or false
associations
3
10. Life Sciences Databases
• Plethora of public Life
Sciences databases in various
formats
• Databases constantly growing
in size and content
• Challenging to keep up-to-
date with growing body of
knowledge
1500+ databases published in NAR
4
11. Which associations (genes) are worth following up?
Often a highly subjective decision
Evaluation of all available information is expensive
How is genotype translated to phenotype?
Often involves direct and indirect interactions
Data integration and knowledge discovery is technically challenging
14. The approach is generic and works similarly for other species
15. Ondex – Data Integration Platform
• Free and open source
• Data warehousing using a graph-database
• Platform to integrate public and private
datasets in various formats
• Provides a GUI, CLI, APIs and workflows for
reproducible data integration
Ondex
www.ondex.org
16. Let’s start with some GWAS data…
http://plants.ensembl.org/biomart
Example Arabidopsis
#SNP=66,816 | #Gene=27,502 | #Phenotype=107
23. How to search and interpret too much information?
• Methods needed to evaluate millions of
relationships in knowledge network, prioritize
genes and extract relevant subnetworks
• Interactive and exploratory tools needed to
enable knowledge discovery and decision
making
• Interpretation should be the task of domain
experts i.e. biologists!
25. Web Browser
Server
Servlets and JSP Page
Java Socket
Knowledge
Graph DBOndex API
JavaScript
Apache Tomcat
Multithreaded
Java Server
HTML, JSON, XML and images
over HTTP via Ajax
Views
Java Socket
KnetMiner System Overview
Client
Client
• Compatibility with all major
web browsers
• Based on D3.js, cytoscape.JS,
node.JS
• Interactive and touch-enabled
Server
• Fast and scalable Java multi-
threaded server
• Pre-indexing of knowledge
graph
• Scoring and information
extraction
27. Google-like search interface
Search knowledge graph using trait-
based keywords
Real-time user feedback and query
suggestions
Trait related
keywords
Query term
suggestions
28. KnetMiner Map View (GenoMaps)
New touch-friendly web
App for Map View in
KnetMiner
Visualize genes, SNP, QTL,
GWAS data.
Select genes within QTL
regions and overlapping
with SNP’s and explore
their network
29. KnetMiner Network View (KnetMaps)
Touch-friendly web App for
Network View in KnetMiner
Explore networks linking
genes to proteins, SNPs,
phenotypes, publications,
etc.
33. Text-mining workflow in Ondex
• Ondex plugins to extract structured information from unstructured free
text
• Developed workflows to enrich knowledge networks with novel links
using the scientific literature
Import
•Ondex Graph
•PubMed
•Ontology
•Tabular
Mapping
•NER-method
•Concept
Class
Transformer
•Publication
•Abstract
•Sentence
Filter
•Relation Type
•Attribute
Value
•Unconnected
Export
•OXL
•RDF
•JSON
Hassani-Pak et al., JIB 2010
34. Ondex text-mining method
Input data
• 27,416 Arabidopsis gene names from Phytozome
• 52,561 Abstracts from PubMed that contain Arabidopsis
• 22,201 curated citations from TAIR
• 1,349 Trait Ontology terms from Planteome
text-mining
x
y
BA
occurrs_in
Publication
Concepts
published_in
weighted association network
IP=1.7; M=1.2; N=2
yx
BAGeneTO
TO
Hassani-Pak et al., JIB 2010
35. Text-mining output
These steps connect 5553 Arabidopsis genes to 409 TO terms
based on 18,341 co-citations (12,190 on sentence level)
36. Text-mining discussion
• TM method is flexible and can easily enhance data integration workflows and
knowledge networks
• TM is one of many evidence types in a knowledge network
• TM provides access to brand-new information that is not yet available in
structured databases
• Modest post-TM-filtering is required to retain high-quality relations
• TM for gene-phenotype adds 12k high-quality relations that were previously
absent in the knowledge network
38. Definition of gene-evidence network
1. Gene-evidence network: Biologically plausible paths (semantic motifs) starting with a Gene node and
ending with Evidence nodes, e.g. 57 semantic motifs were defined in the wheat network
2. Gene-evidence networks are extracted using the Metadata-based Graph Query Engine (Hindle 2012)
3. Evidence nodes can be part of one (high specificity) or many (low specificity) gene-evidence networks
• Gene-evidence network of
Gene X contains 5 nodes
• Neighbourhood network (n=3)
of Gene X contains 9 nodes
X
39. Searching gene-evidence networks for keywords
1. Knowledge graph indexed and searched for user search terms using Lucene
2. A proportion of nodes in the gene-evidence network can contain the search term
auxin
cytokinin
strigolactone
CCD
MAX
subapical shoots
axillary branching
shoot branching
pathway
X
Gene-evidence network User search terms
Gene
40. Gene scoring function (KNETScore)
1. Uses TF*IDF (Sparck & Jones, 1972) to rank documents in gene-
evidence network by their relevance to a search term
2. Uses the specificity of documents to a gene (IGF: Inverse Gene
Frequency)
3. Uses the frequency of evidence concepts, normalised by size of gene-
evidence network (EDF: Evidence Document Frequency)
4. Calculates KNETScore (TFIDF*EDF*IGF) for every gene
41. Gene ranking – Example
Score:
5.72
Score:
2.71
… the left gene scores higher because it has a smaller gene-evidence network and more specific evidence documents
Two genes have a similar number of evidence documents containing the search terms…
42. Discussion – Candidate Gene Prioritization
• In use case study KNETScore ranked causal gene in 3rd place out of 75 genes
within a petal size QTL
• High overlaps between KnetMiner top 100 genes for “gibberellin” and “lipid”
search terms with curated gene lists
• Smart pre-indexing of the knowledge network has reduced the computation of
the score from O(2n(|V|+|E|)) to O(1)
• Many ways to improve the scoring function, e.g. using weights for different
evidence types, distance of evidence to gene and edge-attribute information
43. Summary
• Web application for very fast search of
large genome-scale knowledge graphs
• Ranking of candidate genes based on
knowledge mining
• Interactive visualisation of genome and
knowledge maps
• Facilitates knowledge discovery and
hypothesis generation
44. KnetMiner – Makes Gene Discovery Faster & Fun
International academic collaborations
Interest from industry and start-ups
http://knetminer.rothamsted.ac.uk/
45. KnetMiner 2.0 – BBSRC BBR (GCRF) Proposal
SNP-Seek
Genetic diversity
Novel traits
Phenotype data
KnetMiner 2.0
Interactions
Pathways
Literature
Scientist/Breeder
Novel genes
Better crops
Faster discoveries
Ensembl Plants
Reference genomes
Model Species
Homology data
Data Information Knowledge Insight
A pangenomic and network based approach to search for novel genes and clues to design better rice varieties.
46. Acknowledgements
John Doonan
Sergio Feingold
Martin Castellote
Uwe Scholz
Matthias Lange
Keywan Hassani-Pak
Ajit Singh
Marco Brandizi
Monika Mistry
Lisa Lill
Chris Rawlings
Dave Edwards
Philipp Bayer
Misha Kapushesky
Kevin Dialdestoro
@KnetMiner
Jan Taubert
Artem Lysenko
Matthew Hindle
Catherine CanevetRamil Mauleon
Kenneth McNally
Nickolai Alexandrov
Andy Law
http://www.nature.com/nrg/journal/v2/n5/fig_tab/nrg0501_370a_F1.html
The basic strategy behind mapping quantitative trait loci (QTL) is illustrated here for a | the density of hairs (trichomes) that occur on a plant leaf. Inbred parents that differ in the density of trichomes are crossed to form an F1 population with intermediate trichome density. b | An F1 individual is selfed to form a population of F 2 individuals. c | Each F2 is selfed for six additional generations, ultimately forming several recombinant inbred lines (RILs). Each RIL is homozygous for a section of a parental chromosome. The RILs are scored for several genetic markers, as well as for the trichome density phenotype. In c, the arrow marks a section of chromosome that derives from the parent with low trichome density. The leaves of all individuals that have inherited that section of chromosome from the parent with low trichome density also have low trichome density, indicating that this chromosomal region probably contains a QTL for this trait.
GWAS results can also be complex to interpret
LD in Arabidopsis decays within 10 kb on average
https://www.ncbi.nlm.nih.gov/pubmed/17676040
http://www.oxfordjournals.org/nar/database/c
Worth = Have a positive impact on the biological outcome in the whole organism without producing negative side effects.
Significant SNPs are rarely located within the causal gene sequence…
Consider LD, closest gene is not always the correct candidate…
Consider cofounding, strongest association not always the main causal effect…
Many phenotypes are complex, polygenic and the result of complex interactions on cellular level
Linking genotype and phenotype is one of the greatest challenges in biology
SNP-Phenotype relations (122,919 relations) of significant SNPs (as defined by Ensembl, p-value<0.05?) linked to 107 phenotypes; on average 1,150 SNPs per phenotype.
SNP-Gene relations are based on genes in close proximity to SNPs <1000bp (96,047 relations)
How to integrate GWAS and biological interaction data
Using Ondex
Recent work:
Brought in differential gene expression data for Arabidopsis in KnetMiner.
Added capability in KnetMiner Network View (KnetMaps) to visualize this data as a new concept type: DGES and “differentially_expressed” relation.
The slide shows example of DGES: early vs late flowering in Arabidopsis.
Highlight text-mining
Add gene expression
Mention xrefs and that they cn be collapsed
Scale: As these networks combine a wide variety of entities which are derived by parsing large volumes of data, they are quite large. A typical plant knowledge network created using Ondex has around 100,000 genes linked amongst 500,000 concepts (can be genes, proteins, SNP’s, publications) in approx. 1-1.5 million interactions/ relations.
KnetMiner was developed to enable users to interactively explore such large, genome-scale knowledge networks and extract relevant plausible pathways linking candidate genes to agronomic traits.
KnetMiner works by querying a species-encompassing knowledge network to mine its data and retrieve “relations” (i.e., links/ connections) between the various entities (called “concepts”) within the network, such as genes, proteins, phenotypes, SNP’s and publications.
This helps create a pathway that can be visually traversed to understand how various biological entities are inter-linked and how a gene might influence a specific phenotype/ trait.
KnetMiner is web-based app to search, select and explore a vast amount of information. We try to make the software intuitive and fun to use and embed the user into the discovery process
Achim: “KnetMiner is a great example for how an easy-to-use software app should look like. Even I can figure it out without being a techie or having to read a thick manual. “
Real-time search results, as you type.
Example queries.
Useful dynamic Query suggester to help refine user’s search.
Add specific QTL regions to especially focus on.
Provide your own Gene List to search against.
Once we had all this new data in place and integrated in the knowledge networks in KnetMiner, a new challenge was to work on software tools/ components that enable interactive visualization and exploration of this new, highly dense data in a user-friendly manner.
We collaborated with a UK-based company to develop a new tool called Genomaps.js, a lightweight, touch-friendly tool that enables the interactive visualization of high-density SNP, QTL, GWAS and gene data.
After performing a search in KnetMiner and getting a list of ranked genes as output in Gene View, users can switch to Map View and explore the results in a chromosome-centric view Genomaps.js
Users can now select genes within specific QTL regions and overlapping with SNP’s and explore their network in Network View to look for plausible pathways and interactions.
In order to accommodate the new data and provide a more user-friendly way to explore these networks, the existing Network View was revamped and replaced the new KnetMaps.js
KnetMaps.js is a lightweight, touch-friendly tool that allows users to visualize and explore heterogeneous networks linking genes to proteins, publications, phenotypes, biological processes, etc.
Allows users to identify plausible pathways linking candidate genes to phenotypes within inter-linked networks.
Use case 1 (screencast): An example use case in Arabidopsis, using our newly integrated GWAS data, where we will be looking to discover candidate genes controlling flowering.
Showcasing Genomaps (Map View) in KnetMiner Arabidopsis.
We will specifically look at FLC gene expression, a complex trait controlled by multiple genes. We will search the KnetMiner Arabidopsis knowledge network for the query: ‘flowering FLC’ and identify and explore the network of genes controlling this underlying trait.
Search using keywords: flowering FLC or FT.
Filter Genomaps: Change p-value, show top 100 genes.
Select: FLC, SPA4, FRI, LD & launch KnetMaps.
Use case 2 (screencast): Pre-harvest sprouting:
Next use case will show how to use KnetMiner for mining genes related to grain color and PHS.
This example is based on Andy Phillips (Rothamsted) RNA-seq experiment to identify differences between white and red grain wheat and links to PHS.
Wheat white grain – more prone to sprouting.
Showcasing new Wheat instance with TGACv1 release & new KnetMaps.
Note:
PHS is the result of premature germination of grain in the ear and results in loss of bread-making quality.
Red grain colour is associated with increased dormancy and resistance to PHS.
Grain colour is due to proanthocyanidins (condensed tannins) in the testa.
Boosting queries found in the title
TO by Laurel Cooper
Illustration of a gene-evidence network as derived through “biologically plausible” semantic motifs. Blue nodes represent Gene concepts; red nodes are annotations such as GO, TO, EC, Pathway, Publication concepts. A path that goes via bold edges is valid (biologically meaningful path that allows annotations to be transferred to the seed gene). A path that goes via dashed edges is invalid. A gene-evidence network contains only filled nodes, whereas a gene-neighbourhood network would also contain the unfilled nodes.
Our team has also strong expertise in the development of reusable research software.
Our flagship software output is KnetMiner.
The development of KnetMiner (fka QTLNetMiner) began in 2008 as part of a collaborative project with Steve H to prioritise candidate genes in willow QTL…
Since then it has slowly grown to become a unique resource for gene discovery in many different species.
50% of users from the UK and rest from other countries including…
Ajit has made significant contributions to the project that have helped to improve the usability and the user experience.
Marco who joined us last year is responsible for improving the interoperability of KnetMiner using more standardised technologies.
Monika was a sandwich student for one year and was a great help to maintain or knowledge resources.
KnetMiner has still room to grow.
As part of collaborative project with Sigrid, Gancho and scientists from IRRI, EBI and Uni Malaysia we have submitted a grant application to the Bioinformatics and Biological Resource Fund to extend KnetMiner to rice research and breeding.
BBR is hugely competitive, over 300 expression of interests have been submitted under the GCRF highlight. We might need to find alternative ways to continue the Bioinformatics collaboration with IRRI.
Acknowledgements: Various collaborators who have worked with us on KnetMiner, including contributing partners from science, academia and industry.
Code available on GitHub: https://github.com/KeywanHP/KnetMiner