KnetMiner Overview Oct 2017

KnetMiner – Knowledge Network Miner
Keywan Hassani-Pak
http://knetminer.rothamsted.ac.uk/
@KnetMiner

About Rothamsted Research
• Rothamsted is the longest running agricultural
research station in the world (est. 1843)
• Strategic research to address global food
security demands
• Improving crops to be tolerant to drought, heat
and pests while still providing optimum
nutrition
• Interdisciplinary research from gene to field

Outline
1. Routes to candidate gene discovery
2. Building genome-scale knowledge networks
3. Overview and demo of KnetMiner
4. Extending networks with text-mining
5. Candidate gene prioritization
6. Discussion

Routes to candidate gene
discovery

Routes to candidate gene discovery
Many gene discovery routes can identify candidate
genes for complex traits
Gene
Expression
Genetic
Methods
Candidate
Genes
Prioritization Validation
Markers for
Diagnosis,
Breeding, GM
Phenotype
1
2
Research
Literature
Published
Data
3
4

Quantitative Trait Locus (QTL) Mapping
1. Developing of experimental population
2. Collection of phenotypic and genotypic data
3. Construction of linkage map
4. Correlation of marker/trait
5. Identification of QTL
1
QTL region can encompass 10s to
100s of genes. How to prioritize
them?

Genome Wide Association Studies (GWAS)
FLC gene expression (FLC)
Leaf Number (LN22)
Atwell et al., Nature 2010
• GWAS results can be simple
and complex to interpret
• Peaks can be diffuse covering
several hundred kb without a
clear centre
• Causal polymorphisms have
not always strongest
association
AvrRpm1

Gene expression analysis 2
different tissues or
genotypes
time points after infection
or treatment
Genes
• Gene expression studies can
be complex to interpret
• 100s to 1000s of differentially
expressed genes that are
somehow related to
phenotype
• What are the key pathways
leading to observed
phenotypes?

Text Mining - Trait and Gene Functions
• Publications (free text) are
most up-to-date resource for
information
• Finding sentences that link
phenotypes (flowering time)
and gene function (circadian
clock) to genes (CONSTANS)
• Term variability and ambiguity
can produce missing or false
associations
3

Life Sciences Databases
• Plethora of public Life
Sciences databases in various
formats
• Databases constantly growing
in size and content
• Challenging to keep up-to-
date with growing body of
knowledge
1500+ databases published in NAR
4

Which associations (genes) are worth following up?
Often a highly subjective decision
Evaluation of all available information is expensive
How is genotype translated to phenotype?
Often involves direct and indirect interactions
Data integration and knowledge discovery is technically challenging

Building genome-scale
knowledge networks

Biological knowledge network/graph
Genotype
• QTL
• GWAS
Omics
• Transcriptomics
• Proteomics
• Metabolomics
Phenotype
• Disease
• Development
• Stress tolerance
Biological Knowledge Network
• Prior knowledge
• Structured, unstructured data
• Cross-species data
IntegrationIntegration

The approach is generic and works similarly for other species

Ondex – Data Integration Platform
• Free and open source
• Data warehousing using a graph-database
• Platform to integrate public and private
datasets in various formats
• Provides a GUI, CLI, APIs and workflows for
reproducible data integration
Ondex
www.ondex.org

Let’s start with some GWAS data…
http://plants.ensembl.org/biomart
Example Arabidopsis
#SNP=66,816 | #Gene=27,502 | #Phenotype=107

… transform into a network
(SNP)
(Phenotype)
associated

Biological interaction datasets
http://thebiogrid.org

(SNP)
(Phenotype)
associated
… add biological interactions

… add differential gene expression data
early vs late flowering

… add other linked data
• Gene-GO
• Gene-Phenotype
• Gene knock-out or overexpression
• Text mining publications
• Gene-Publication
• Gene-Pathway
• Gene-Expression
• Protein-Small Molecule
• Homology to other species
>800k nodes
>3 million edges
Genome-Scale Knowledge Network (GSKN)

Same principles for other species
Knowledge graph of LRRK2 human gene

How to search and interpret too much information?
• Methods needed to evaluate millions of
relationships in knowledge network, prioritize
genes and extract relevant subnetworks
• Interactive and exploratory tools needed to
enable knowledge discovery and decision
making
• Interpretation should be the task of domain
experts i.e. biologists!

Overview and Demo of
KnetMiner

Web Browser
Server
Servlets and JSP Page
Java Socket
Knowledge
Graph DBOndex API
JavaScript
Apache Tomcat
Multithreaded
Java Server
HTML, JSON, XML and images
over HTTP via Ajax
Views
Java Socket
KnetMiner System Overview
Client
Client
• Compatibility with all major
web browsers
• Based on D3.js, cytoscape.JS,
node.JS
• Interactive and touch-enabled
Server
• Fast and scalable Java multi-
threaded server
• Pre-indexing of knowledge
graph
• Scoring and information
extraction

KnetMiner UI Overview
Search Select Explore

Google-like search interface
Search knowledge graph using trait-
based keywords
Real-time user feedback and query
suggestions
Trait related
keywords
Query term
suggestions

KnetMiner Map View (GenoMaps)
New touch-friendly web
App for Map View in
KnetMiner
Visualize genes, SNP, QTL,
GWAS data.
Select genes within QTL
regions and overlapping
with SNP’s and explore
their network

KnetMiner Network View (KnetMaps)
Touch-friendly web App for
Network View in KnetMiner
Explore networks linking
genes to proteins, SNPs,
phenotypes, publications,
etc.

Extending knowledge
networks with text-mining

Text-mining workflow in Ondex
• Ondex plugins to extract structured information from unstructured free
text
• Developed workflows to enrich knowledge networks with novel links
using the scientific literature
Import
•Ondex Graph
•PubMed
•Ontology
•Tabular
Mapping
•NER-method
•Concept
Class
Transformer
•Publication
•Abstract
•Sentence
Filter
•Relation Type
•Attribute
Value
•Unconnected
Export
•OXL
•RDF
•JSON
Hassani-Pak et al., JIB 2010

Ondex text-mining method
Input data
• 27,416 Arabidopsis gene names from Phytozome
• 52,561 Abstracts from PubMed that contain Arabidopsis
• 22,201 curated citations from TAIR
• 1,349 Trait Ontology terms from Planteome
text-mining
x
y
BA
occurrs_in
Publication
Concepts
published_in
weighted association network
IP=1.7; M=1.2; N=2
yx
BAGeneTO
TO
Hassani-Pak et al., JIB 2010

Text-mining output
These steps connect 5553 Arabidopsis genes to 409 TO terms
based on 18,341 co-citations (12,190 on sentence level)

Text-mining discussion
• TM method is flexible and can easily enhance data integration workflows and
knowledge networks
• TM is one of many evidence types in a knowledge network
• TM provides access to brand-new information that is not yet available in
structured databases
• Modest post-TM-filtering is required to retain high-quality relations
• TM for gene-phenotype adds 12k high-quality relations that were previously
absent in the knowledge network

Definition of gene-evidence network
1. Gene-evidence network: Biologically plausible paths (semantic motifs) starting with a Gene node and
ending with Evidence nodes, e.g. 57 semantic motifs were defined in the wheat network
2. Gene-evidence networks are extracted using the Metadata-based Graph Query Engine (Hindle 2012)
3. Evidence nodes can be part of one (high specificity) or many (low specificity) gene-evidence networks
• Gene-evidence network of
Gene X contains 5 nodes
• Neighbourhood network (n=3)
of Gene X contains 9 nodes
X

Searching gene-evidence networks for keywords
1. Knowledge graph indexed and searched for user search terms using Lucene
2. A proportion of nodes in the gene-evidence network can contain the search term
auxin
cytokinin
strigolactone
CCD
MAX
subapical shoots
axillary branching
shoot branching
pathway
X
Gene-evidence network User search terms
Gene

Gene scoring function (KNETScore)
1. Uses TF*IDF (Sparck & Jones, 1972) to rank documents in gene-
evidence network by their relevance to a search term
2. Uses the specificity of documents to a gene (IGF: Inverse Gene
Frequency)
3. Uses the frequency of evidence concepts, normalised by size of gene-
evidence network (EDF: Evidence Document Frequency)
4. Calculates KNETScore (TFIDF*EDF*IGF) for every gene

Gene ranking – Example
Score:
5.72
Score:
2.71
… the left gene scores higher because it has a smaller gene-evidence network and more specific evidence documents
Two genes have a similar number of evidence documents containing the search terms…

Discussion – Candidate Gene Prioritization
• In use case study KNETScore ranked causal gene in 3rd place out of 75 genes
within a petal size QTL
• High overlaps between KnetMiner top 100 genes for “gibberellin” and “lipid”
search terms with curated gene lists
• Smart pre-indexing of the knowledge network has reduced the computation of
the score from O(2n(|V|+|E|)) to O(1)
• Many ways to improve the scoring function, e.g. using weights for different
evidence types, distance of evidence to gene and edge-attribute information

Summary
• Web application for very fast search of
large genome-scale knowledge graphs
• Ranking of candidate genes based on
knowledge mining
• Interactive visualisation of genome and
knowledge maps
• Facilitates knowledge discovery and
hypothesis generation

KnetMiner – Makes Gene Discovery Faster & Fun
International academic collaborations
Interest from industry and start-ups
http://knetminer.rothamsted.ac.uk/

KnetMiner 2.0 – BBSRC BBR (GCRF) Proposal
SNP-Seek
Genetic diversity
Novel traits
Phenotype data
KnetMiner 2.0
Interactions
Pathways
Literature
Scientist/Breeder
Novel genes
Better crops
Faster discoveries
Ensembl Plants
Reference genomes
Model Species
Homology data
Data Information Knowledge Insight
A pangenomic and network based approach to search for novel genes and clues to design better rice varieties.

Acknowledgements
John Doonan
Sergio Feingold
Martin Castellote
Uwe Scholz
Matthias Lange
Keywan Hassani-Pak
Ajit Singh
Marco Brandizi
Monika Mistry
Lisa Lill
Chris Rawlings
Dave Edwards
Philipp Bayer
Misha Kapushesky
Kevin Dialdestoro
@KnetMiner
Jan Taubert
Artem Lysenko
Matthew Hindle
Catherine CanevetRamil Mauleon
Kenneth McNally
Nickolai Alexandrov
Andy Law

KnetMiner Overview Oct 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to KnetMiner Overview Oct 2017

Similar to KnetMiner Overview Oct 2017 (20)

Recently uploaded

Recently uploaded (20)

KnetMiner Overview Oct 2017

Editor's Notes