Three TSRI Tools for capturing,
sharing, and applying community
knowledge
Benjamin Good
The Scripps Research Institute
@bgood
Outline
• Gene wiki, quick recap, update
• Introducing:
– http://knowledge.bio
– http://biobranch.org
Gene Wiki (on Wikipedia)
3
Protein structure
Symbols and
identifiers
Tissue expression
pattern
Gene Ontology
annotations
Links to structured
databases
Gene
summary
Protein
interactions
Linked
references
Huss, PLoS Biol, 2008
Wikidata
4
is a
regulates
Interacts
with
Protein
Glycoprotein
Neural
development
VLDL receptor
Amyloid
precursor
protein
Property:P31
Property:P128
Property:P129
Q8054
Q187126
Q1345738
Q1979313
Q423510
Q414043
Reelin
http://www.wikidata.org/wiki/Q414043
A computable Gene (& Disease & Drug) Wiki
5
Structured data
Here now
Soon
Downstream
(but exciting potential..)
?
? ?
Wikipedia(s)
Status Update
• Genes, diseases (and any minute.. Drugs) are
in wikidata
• Demonstrations of incorporating this content
in Wikipedia are functional
• We’ve been slowed a little bit by wikidata
governance policies.. (they blocked our bot
temporarily)
Wikidata activities
• YOU can help!
• https://www.wikidata.org/wiki/User:ProteinBoxBot
Join in one of these
discussions and voice
your support
Outline
• Gene wiki
• knowledge.bio
• biobranch.org
Knowledge.bio
• Provides a concept-centric view of the
scientific literature.
– You search and interact with concepts rather than
documents.
• Main purpose is hypothesis generation
• 2 data sources mined from PubMed
– 70 million Explicit semantic relations (‘triples’)
– 200 million Implicit gene-disease associations
http://knowledge.bio
Explicit relations view
Search for concept
View related concepts
(67 results)
Filter results
View text where triple was extracted
Diseases implicitly related to queried
concept: CYP2R1
Concepts linking
CYP2R1 to Smith-Lemli Opitz
Syndrome
Table views complemented by a
Network view for taking notes..
Network (“Map”) view
Cytoscape.js canvas
Auto and manual layout
Save Map as local
text file
Load saved map
Step 1: find candidate relation
What new diseases might be related to CYP2R1?
Implicit prediction
Step 2: find linking concepts
How is CYP2R1 related to SLO syndrome?
Step 3: Start building a hypothesis to
explain the predicted relation
Do CYP2R1 and DHCR7 participate in a process related to SLO syndrome?
Explicit relations view
Warning, may prove addictive..
Next steps for knowledge.bio
• Enhanced community sharing
• Integration with http://ndexbio.org from the
Cytoscape consortium
• Allow user actions to feedback into underlying
NLP systems
• Include access to other structured knowledge
sources e.g. Gene Ontology
Outline
• Gene wiki
• knowledge.bio
• biobranch.org
Breast cancer prognosis:
10 year survival?
find patterns
Inferring class predictors
No
van't Veer, Laura J., et al. "Gene expression profiling predicts clinical outcome of breast cancer.” Nature 415.6871 (2002):
530-536.
Yes make predictions on new samples
No
Yes
10 year
survival?
find patterns
make predictions
inferring survival predictors
1) select genes
2) infer predictor from data (e.g. decision tree, SVM, etc.)
Out of the 25,000+ genes, which
small set works together the best?
No
Yes
10 year
survival?
Problem: gene selection instability
instability: different methods, different datasets produce
different gene sets for the same phenotype [1]
[1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer.”
Genome Medicine 5.10 (2013).
Problem: the validation gap
training data,
test data
validation
validation: predictive signatures often
perform worse on independent data created
for validation.
Photograph by Richard Hallman, National Geographic Adventure Blog
find patterns
make predictions
Adding prior knowledge to the discovery algorithm
<10 yr
survival
>10 yr
survival
Ex.) Network guided forests
Use protein
interaction
network to find
good gene
combinations
Dutkowski & Ideker (2011) Protein Networks as Logic Functions in Development in Development and Cancer. PLoS Computational Biology
But most knowledge is not structured
500000
550000
600000
650000
700000
750000
800000
850000
900000
950000
1000000
Number
articles
added to
PubMed
>100 publications/hour
>194715 publications linked to “breast cancer” since 2000
http://tinyurl.com/brsince2000
How can we use unstructured
knowledge to improve predictors?
Need a distributed network of intelligent systems that are good at
reading and hypothesizing
Like you
and your friends
A game with a purpose: The Cure
• http://genegames.org/cure
• http://games.jmir.org/2014/2/e
7/
• The Cure: Design and Evaluation of a
Crowdsourcing Game for Gene
Selection for Breast Cancer Survival
Prediction JMIR Serious Games PMID:
25654473
People wanted to control the trees
http://biobranch.org
Branch Goals
• Provide easy, visual way for non-programmers
to use large datasets to answer questions
• Construct libraries of manually crafted
predictive models
• Use the collected models to generate
ensemble predictors that incorporate the
knowledge of the users
Branch walkthrough:
Choose a dataset
Select evaluation option
Tree Builder
Split node builder
Each button is a different way
to compose a split node in your
decision tree
Split node
Predictions at
leaf nodes
100%
correct
56%
accurate
View data,
adjust split
point
If age less than 34.5
Predict relapse
If greater,
Predict no relapse
Single feature splits
Pick from genes or clinical features
Type-ahead search
Statistical ranker
Custom feature combination
BRCA2
TOP2B
BRCA2 + TOP2B
Allows user to use a manually composed
linear combination of other features
Eg: 21 Gene Signature from
OncoType Dx
Proliferation
Ki67
STK15
Survivin
CCNB1 (cyclin B1)
MYBL2
Invasion
MMP11
CTSL2
HER2
GRB7
HER2
Estrogen
ER
PGR
BCL2
SCUBE2
GSTM
1
Reference
ACTB(b-actin)
GAPDH
RPLPO
GUS
TFRC
Recurrence Score Algorithm
1. HER2 group score = 0.9 x GRB7+ 0.1 x HER2 (if the result is less than 8, then the GRB7 group
score is considered 8);
2. ER group score = (0.8x ER +1.2 x PGR + BCL2+ SCUBE2)÷4
3. Proliferation group score = ( Survivin + KI67 + MYBL2 + CCNB1 [the gene encoding cyclin B1]+
STK15 )÷5 (if the result is less than 6.5, then the proliferation group score is considered 6.5)
4. Invasion group score=( CTSL2 [the gene encoding cathepsin L2] + MMP11 [the gene encoding
stromolysin 3])÷2.
RSU=0.47* HER2- 0.34* ER +1.04*
PROLIFERATION + 0.10* INVASION +0.05* CD68 -
0.08* GSTM1 -0.07* BAG1
*A Multigene Assay to Predict Recurrence of Tamoxifen-Treated, Node-Negative Breast Cancer
CD68
BAG1
Classifier nodes
Classifier
Node
Class
B
Class
A
…...
…...
…...
…...
…...
…...
…...
…...
…...
…...
…...
…...
Use a trained predictive model such as
A Support Vector Machine as a node in your tree
Use
Build
biobranch tree nodes
Branch
decision tree
Class
B
Class
A
…...
…...
…...
…...
…...
…...
…...
…...
…...
…...
…...
…...
Use a previously constructed tree as node
Visually set decision
boundary nodes
Visual split
Class
B
Class
A
…...
…...
…...
…...
…...
…...
…...
…...
…...
…...
…...
…...
Creating a visual split
Draw polygon
Add to tree
Select feature
Select feature
Teach students about overfitting..
Tree Builder
Evaluation panel
View training and testing sets
Performance metrics
Confusion matrix
ROC curve
Navigation
Save your treeNew tree
Tree Collection
Open and edit shared tree
Search trees you create and
trees shared with the
community
Editing shared tree
Tracks which user created each node
Next steps
• More user testing
• More datasets
• Lots of users?
• Better models?
training data,
test data
validation
Even more information!
• Screencasts
• http://tinyurl.com/branch-cast
• Open source code
– https://bitbucket.org/sulab/biobranch
– https://bitbucket.org/starinformatics/gbk
– https://bitbucket.org/sulab/wikidatabots
Thanks
Funding and Support
BioGPS: GM83924
Gene Wiki: GM089820
BD2K COE: GM114833
Andra Waagmeester
Sebastian Burgstaller
Elvira Mitraka
Lynn Schriml
Gang Fu
Evan Bolton
Paul Pavlidis
Peter Robinson
Many WikiDatans
Richard Bruskiewich
http://starinformatics.com
Karthik Gangavarapu
Vyshakh Babji
Andrew Su
The Prince
of Crowdsourcing
Implicitome
Kristina Hettne, Leiden University
Contact:
bgood@scripps.edu
@bgood

2015 6 bd2k_biobranch_knowbio

  • 1.
    Three TSRI Toolsfor capturing, sharing, and applying community knowledge Benjamin Good The Scripps Research Institute @bgood
  • 2.
    Outline • Gene wiki,quick recap, update • Introducing: – http://knowledge.bio – http://biobranch.org
  • 3.
    Gene Wiki (onWikipedia) 3 Protein structure Symbols and identifiers Tissue expression pattern Gene Ontology annotations Links to structured databases Gene summary Protein interactions Linked references Huss, PLoS Biol, 2008
  • 4.
  • 5.
    A computable Gene(& Disease & Drug) Wiki 5 Structured data Here now Soon Downstream (but exciting potential..) ? ? ? Wikipedia(s)
  • 6.
    Status Update • Genes,diseases (and any minute.. Drugs) are in wikidata • Demonstrations of incorporating this content in Wikipedia are functional • We’ve been slowed a little bit by wikidata governance policies.. (they blocked our bot temporarily)
  • 7.
    Wikidata activities • YOUcan help! • https://www.wikidata.org/wiki/User:ProteinBoxBot Join in one of these discussions and voice your support
  • 8.
    Outline • Gene wiki •knowledge.bio • biobranch.org
  • 9.
    Knowledge.bio • Provides aconcept-centric view of the scientific literature. – You search and interact with concepts rather than documents. • Main purpose is hypothesis generation • 2 data sources mined from PubMed – 70 million Explicit semantic relations (‘triples’) – 200 million Implicit gene-disease associations
  • 10.
  • 11.
    Explicit relations view Searchfor concept View related concepts (67 results) Filter results
  • 12.
    View text wheretriple was extracted
  • 13.
    Diseases implicitly relatedto queried concept: CYP2R1
  • 14.
    Concepts linking CYP2R1 toSmith-Lemli Opitz Syndrome
  • 15.
    Table views complementedby a Network view for taking notes..
  • 16.
    Network (“Map”) view Cytoscape.jscanvas Auto and manual layout Save Map as local text file Load saved map
  • 17.
    Step 1: findcandidate relation What new diseases might be related to CYP2R1? Implicit prediction
  • 18.
    Step 2: findlinking concepts How is CYP2R1 related to SLO syndrome?
  • 19.
    Step 3: Startbuilding a hypothesis to explain the predicted relation Do CYP2R1 and DHCR7 participate in a process related to SLO syndrome? Explicit relations view
  • 20.
    Warning, may proveaddictive..
  • 21.
    Next steps forknowledge.bio • Enhanced community sharing • Integration with http://ndexbio.org from the Cytoscape consortium • Allow user actions to feedback into underlying NLP systems • Include access to other structured knowledge sources e.g. Gene Ontology
  • 22.
    Outline • Gene wiki •knowledge.bio • biobranch.org
  • 23.
    Breast cancer prognosis: 10year survival? find patterns Inferring class predictors No van't Veer, Laura J., et al. "Gene expression profiling predicts clinical outcome of breast cancer.” Nature 415.6871 (2002): 530-536. Yes make predictions on new samples No Yes 10 year survival?
  • 24.
    find patterns make predictions inferringsurvival predictors 1) select genes 2) infer predictor from data (e.g. decision tree, SVM, etc.) Out of the 25,000+ genes, which small set works together the best? No Yes 10 year survival?
  • 25.
    Problem: gene selectioninstability instability: different methods, different datasets produce different gene sets for the same phenotype [1] [1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer.” Genome Medicine 5.10 (2013).
  • 26.
    Problem: the validationgap training data, test data validation validation: predictive signatures often perform worse on independent data created for validation. Photograph by Richard Hallman, National Geographic Adventure Blog
  • 27.
    find patterns make predictions Addingprior knowledge to the discovery algorithm <10 yr survival >10 yr survival
  • 28.
    Ex.) Network guidedforests Use protein interaction network to find good gene combinations Dutkowski & Ideker (2011) Protein Networks as Logic Functions in Development in Development and Cancer. PLoS Computational Biology
  • 29.
    But most knowledgeis not structured 500000 550000 600000 650000 700000 750000 800000 850000 900000 950000 1000000 Number articles added to PubMed >100 publications/hour >194715 publications linked to “breast cancer” since 2000 http://tinyurl.com/brsince2000
  • 30.
    How can weuse unstructured knowledge to improve predictors? Need a distributed network of intelligent systems that are good at reading and hypothesizing Like you and your friends
  • 31.
    A game witha purpose: The Cure • http://genegames.org/cure • http://games.jmir.org/2014/2/e 7/ • The Cure: Design and Evaluation of a Crowdsourcing Game for Gene Selection for Breast Cancer Survival Prediction JMIR Serious Games PMID: 25654473
  • 32.
    People wanted tocontrol the trees
  • 33.
  • 34.
    Branch Goals • Provideeasy, visual way for non-programmers to use large datasets to answer questions • Construct libraries of manually crafted predictive models • Use the collected models to generate ensemble predictors that incorporate the knowledge of the users
  • 35.
  • 36.
  • 37.
  • 38.
    Split node builder Eachbutton is a different way to compose a split node in your decision tree
  • 39.
    Split node Predictions at leafnodes 100% correct 56% accurate View data, adjust split point If age less than 34.5 Predict relapse If greater, Predict no relapse
  • 40.
    Single feature splits Pickfrom genes or clinical features Type-ahead search Statistical ranker
  • 41.
    Custom feature combination BRCA2 TOP2B BRCA2+ TOP2B Allows user to use a manually composed linear combination of other features
  • 42.
    Eg: 21 GeneSignature from OncoType Dx Proliferation Ki67 STK15 Survivin CCNB1 (cyclin B1) MYBL2 Invasion MMP11 CTSL2 HER2 GRB7 HER2 Estrogen ER PGR BCL2 SCUBE2 GSTM 1 Reference ACTB(b-actin) GAPDH RPLPO GUS TFRC Recurrence Score Algorithm 1. HER2 group score = 0.9 x GRB7+ 0.1 x HER2 (if the result is less than 8, then the GRB7 group score is considered 8); 2. ER group score = (0.8x ER +1.2 x PGR + BCL2+ SCUBE2)÷4 3. Proliferation group score = ( Survivin + KI67 + MYBL2 + CCNB1 [the gene encoding cyclin B1]+ STK15 )÷5 (if the result is less than 6.5, then the proliferation group score is considered 6.5) 4. Invasion group score=( CTSL2 [the gene encoding cathepsin L2] + MMP11 [the gene encoding stromolysin 3])÷2. RSU=0.47* HER2- 0.34* ER +1.04* PROLIFERATION + 0.10* INVASION +0.05* CD68 - 0.08* GSTM1 -0.07* BAG1 *A Multigene Assay to Predict Recurrence of Tamoxifen-Treated, Node-Negative Breast Cancer CD68 BAG1
  • 43.
  • 44.
    biobranch tree nodes Branch decisiontree Class B Class A …... …... …... …... …... …... …... …... …... …... …... …... Use a previously constructed tree as node
  • 45.
    Visually set decision boundarynodes Visual split Class B Class A …... …... …... …... …... …... …... …... …... …... …... …...
  • 46.
    Creating a visualsplit Draw polygon Add to tree Select feature Select feature
  • 47.
    Teach students aboutoverfitting..
  • 48.
  • 49.
    Evaluation panel View trainingand testing sets Performance metrics Confusion matrix ROC curve
  • 50.
  • 51.
    Tree Collection Open andedit shared tree Search trees you create and trees shared with the community
  • 52.
    Editing shared tree Trackswhich user created each node
  • 53.
    Next steps • Moreuser testing • More datasets • Lots of users? • Better models? training data, test data validation
  • 54.
    Even more information! •Screencasts • http://tinyurl.com/branch-cast • Open source code – https://bitbucket.org/sulab/biobranch – https://bitbucket.org/starinformatics/gbk – https://bitbucket.org/sulab/wikidatabots
  • 55.
    Thanks Funding and Support BioGPS:GM83924 Gene Wiki: GM089820 BD2K COE: GM114833 Andra Waagmeester Sebastian Burgstaller Elvira Mitraka Lynn Schriml Gang Fu Evan Bolton Paul Pavlidis Peter Robinson Many WikiDatans Richard Bruskiewich http://starinformatics.com Karthik Gangavarapu Vyshakh Babji Andrew Su The Prince of Crowdsourcing Implicitome Kristina Hettne, Leiden University Contact: bgood@scripps.edu @bgood

Editor's Notes

  • #4 Active MCB community at WP had already developed ~650 gene articles Can we accelerate this process through stub creation? In total, created 9000 new articles and edited 650 previously existing articles.
  • #24 a, Two-dimensional presentation of transcript ratios for 98 breast tumours. There were 4,968 significant genes across the group. Each row represents a tumour and each column a single gene. As shown in the colour bar, red indicates upregulation, green downregulation, black no change, and grey no data available. The yellow line marks the subdivision into two dominant tumour clusters. b, Selected clinical data for the 98 patients in a: BRCA1 germline mutation carrier (or sporadic patient), ER expression, tumour grade 3 (versus grade 1 and 2), lymphocytic infiltrate, angioinvasion, and metastasis status. White indicates positive, black negative and grey denotes tumours derived from BRCA1 germline carriers who were excluded from the metastasis evaluation. The cluster below the yellow line consists of 36 tumours, of which 34 are ER negative (total 39 ER-negative) and 16 are carriers of the BRCA1 mutation (total 18). c, Enlarged portion from a containing a group of genes that co-regulate with the ER- gene (ESR1). Each gene is labelled by its gene name or accession number from GenBank. Contig ESTs ending with RC are reverse-complementary of the named contig EST. d, Enlarged portion from a containing a group of co-regulated genes that are the molecular reflection of extensive lymphocytic infiltrate, and comprise a set of genes expressed in T and B cells. (Gene annotation as in c.)
  • #25 a, Two-dimensional presentation of transcript ratios for 98 breast tumours. There were 4,968 significant genes across the group. Each row represents a tumour and each column a single gene. As shown in the colour bar, red indicates upregulation, green downregulation, black no change, and grey no data available. The yellow line marks the subdivision into two dominant tumour clusters. b, Selected clinical data for the 98 patients in a: BRCA1 germline mutation carrier (or sporadic patient), ER expression, tumour grade 3 (versus grade 1 and 2), lymphocytic infiltrate, angioinvasion, and metastasis status. White indicates positive, black negative and grey denotes tumours derived from BRCA1 germline carriers who were excluded from the metastasis evaluation. The cluster below the yellow line consists of 36 tumours, of which 34 are ER negative (total 39 ER-negative) and 16 are carriers of the BRCA1 mutation (total 18). c, Enlarged portion from a containing a group of genes that co-regulate with the ER- gene (ESR1). Each gene is labelled by its gene name or accession number from GenBank. Contig ESTs ending with RC are reverse-complementary of the named contig EST. d, Enlarged portion from a containing a group of co-regulated genes that are the molecular reflection of extensive lymphocytic infiltrate, and comprise a set of genes expressed in T and B cells. (Gene annotation as in c.)
  • #26 Main reason statistically is inadequate sample size and correlated data structure. (Xu 2010). Makes it difficult to trust the predictors when different genes appear every time.
  • #27 though progress is being made on this issue, e.g. Margolin showed very good agreement between cross-validation, test set, and validation performance for models submitted to Sage challenge.
  • #28 a, Two-dimensional presentation of transcript ratios for 98 breast tumours. There were 4,968 significant genes across the group. Each row represents a tumour and each column a single gene. As shown in the colour bar, red indicates upregulation, green downregulation, black no change, and grey no data available. The yellow line marks the subdivision into two dominant tumour clusters. b, Selected clinical data for the 98 patients in a: BRCA1 germline mutation carrier (or sporadic patient), ER expression, tumour grade 3 (versus grade 1 and 2), lymphocytic infiltrate, angioinvasion, and metastasis status. White indicates positive, black negative and grey denotes tumours derived from BRCA1 germline carriers who were excluded from the metastasis evaluation. The cluster below the yellow line consists of 36 tumours, of which 34 are ER negative (total 39 ER-negative) and 16 are carriers of the BRCA1 mutation (total 18). c, Enlarged portion from a containing a group of genes that co-regulate with the ER- gene (ESR1). Each gene is labelled by its gene name or accession number from GenBank. Contig ESTs ending with RC are reverse-complementary of the named contig EST. d, Enlarged portion from a containing a group of co-regulated genes that are the molecular reflection of extensive lymphocytic infiltrate, and comprise a set of genes expressed in T and B cells. (Gene annotation as in c.)
  • #50 A pathological hallmark of SLO is increased levels of 7-dihydrocholesterol [PMID: 8259166]. 7-dihydrocholesterol is a precursor of vitamin D3 and defects in CYP2R1 are known to affect vitamin D3 levels [PMID: 15128933]. Thus, LWAS implicates CYP2R1 in SLO syndrome since defects may potentially lead to 7-dihydrocholesterol accumulation.