Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
THE CURE: A GAME WITH THE PURPOSE OF
GENE SELECTION FOR BREAST CANCER
SURVIVAL PREDICTION
Benjamin Good*, Salvatore Loguer...
A QUESTION

How would you get 150 PhD level scientists
to work together on the same problem?

Without any money?
TRAIL MAP

Games
Survival Prediction
The Cure
WHY GAMES?

It is estimated that 9 billion
hours are spent playing
Solitaire every year

Luis Von Ahn. : Google Tech Talk:...
Seven million hours of human labor

ONE YEAR SOLITAIRE =
1,285 EMPIRE STATE
BUILDINGS

Empire State Building
150 billion hours gaming each year

What if we could use a tiny fraction of that
human effort to achieve another purpose?
...
PURPOSES
Computer
science
Find objects
inside
images
Tag songs

Label all images
on the Web

Rate image
quality

Biology
F...
GAMES WITH A PURPOSE

MOLT
The Cure
TRAIL MAP

Games
Survival Prediction
The Cure
INFERRING SURVIVAL PREDICTORS
10 year
Nosurvival?

Yes

make predictions on new samples

find patterns

10 year survival?
...
INFERRING SURVIVAL PREDICTORS
find patterns

make predictions
No

10 year survival?
Yes

1) select genes

Out of the 25,00...
PROBLEM: GENE SELECTION INSTABILITY

instability: different methods, different datasets
produce different gene sets for th...
PROBLEM: THE VALIDATION GAP

training
data, test
data
validation
validation: predictive signatures often perform
worse on ...
ADDING PRIOR KNOWLEDGE TO THE DISCOVERY
ALGORITHM
make predictions
find patterns

<10 yr
survival
>10 yr
survival
EX.) NETWORK GUIDED FORESTS

Use network to find
good gene
combinations

Dutkowski & Ideker (2011) Protein Networks as Log...
BUT MOST KNOWLEDGE IS NOT STRUCTURED
1000000
950000
900000
850000

Number 800000
articles
750000
added to
PubMed 700000

1...
HOW CAN WE USE UNSTRUCTURED
KNOWLEDGE FOR GENE SELECTION?

Need an intelligent system that is good at reading and hypothes...
TRAIL MAP

Games
Survival Prediction
The Cure
THE CURE

HTTP://GENEGAMES.ORG/CURE/
education level?
cancer knowledge?

biologist?
PLAY = GENE SELECTION
Opponents
hand

Alternate turns
picking a gene from
a “board” of 25

Your
hand
SCORING
Score reflects accuracy of
decision tree created with
just the selected genes
on real training data

Cure Server
PLAY WITH KNOWLEDGE: GENE ONTOLOGY
PLAY WITH KNOWLEDGE: GENE RIFS
YOU WIN!
COMMUNITY BOARD VIEW,
CHOOSE OPEN BOARD
You beat this one

The community
finished this board
(e.g. 11 different
players co...
BOARDS
• 25 genes each

• randomly selected from 1,250 genes that passed an
unsupervised filter for minimum expression lev...
1,077 Players registered (one year)
http://io9.com/
these-cool-games-let-you-do-real-life-science-486173006

PLAYERS
250

...
PLAYER DEMOGRAPHICS
graduate_degree
undergraduate

none

800

350
300
250
Most
200
recent
150
degree 100
50
0

800

600

6...
GAMES PLAYED

• 9,904 games (non training)

Total games played per player

games played, top 20 players

10000

800

PhD

...
GENE RANKINGS FROM GAMES
make predictions
find patterns

<10 yr
survival
>10 yr
survival
GENE RANKINGS FROM GAMES
•

For each gene:
1. O = number of times it appeared in a game (some genes occur on multiple boar...
GENES SELECTED BY ALL PLAYERS
9904 GAMES
P<0.001, 60 GENES
Top 10 enriched disease annotations

n genes

adj. P < 2.43e-06...
GENES SELECTED BY PEOPLE:
WITH PHDS
WITH KNOWLEDGE OF CANCER,
2373 GAMES
P<0.001, 82 GENES
Top 10 enriched disease annotat...
GENES SELECTED BY PEOPLE:
WITHOUT PHDS,
WITH NO KNOWLEDGE OF CANCER,
THAT ARE NOT BIOLOGISTS
3607 GAMES
P<0.001 , 10 GENES...
SELF REPORTING SEEMED TO WORK...
EVEN WITHOUT FILTERING, THE DATA CONTAINS
THE KNOWLEDGE
•

“All Players” still contained significant cancer signal.
PROBLEM: GENE SELECTION INSTABILITY

instability: different methods, different datasets
produce different gene sets for th...
GENE SET OVERLAPS, SOME BUT NOT MUCH
“Expert Gene Set”

http://bioinformatics.psb.ugent.be/webtools/Venn/
PROBLEM: THE VALIDATION GAP

training
data, test
data
validation
validation: predictive signatures often perform
worse on ...
CLASSIFIER PERFORMANCE WITH DIFFERENT
GENE GROUPS, DIFFERENT DATASETS
10 year survival
Yes
No

X-axis Test Set performance...
SUMMARY
Plusses
•

1 year

•

1,000 players, 150 PhDs

•

10,000 games

•

“expert knowledge” captured through an
open gam...
NEXT STEPS
•

More fun

•

More learning for novices

•

More control for experts

•

More data
THE END
Thanks to:
Players!!!!
Andrew Su
Salvatore Loguercio
Max Nanis
Karthik Gangavarapu
Funding

More information at:
h...
GAMES WITH A PURPOSE

of collecting expert level knowledge

Khatib, Firas, et al. "Algorithm discovery by
protein folding ...
HUMAN GUIDED FOREST (HGF)

Let CURE players build
decision modules

http://i9606.blogspot.com/2012/04/human-guided-forests...
WHY DID YOU SIGN UP? (83 RESPONSES)
Why did you sign up for The Cure? (select all that apply)
90.0%
80.0%
70.0%
60.0%
50.0...
WAS THE GAME FUN?
0.8
0.7
0.6

percent

0.5
0.4
0.3
0.2
0.1
0
Yes, it was very fun

A little bit entertaining

No, not at ...
DO YOU KNOW ANYONE THAT HAS OR HAD
BREAST CANCER?
Have you known or do you currently know anyone that has or has had breas...
DID YOU LEARN ANYTHING FROM PLAYING?
60
50
40
30
20
10
0
Yes, I felt like I learned a lot

Yes, I learned a little bit

No...
MY KNOWLEDGE OF BREAST CANCER IS:
0.6

0.5

0.4

0.3

0.2

0.1

0
I am an expert in breast I have helped conduct I know so...
AGE?
Which category below includes your age?

17 or younger
18-20
21-29
30-39
40-49
50-59
60 and above
GENDER?
What is your gender?

Female
Male
TRAINING LEVELS
the decision tree created using the
feature “makes milk” is 100%
correct on training data, you win!
TRAINING INTERFACE

Choose the feature that best
distinguishes mammals from other
creatures
TRAINING INTERFACE

the decision tree created using the
feature “has hair” is 94% correct
on training data, you win!
OVERLAP OF SIGNIFICANT GENE SETS FROM
DIFFERENT CURE GAME FILTERS
PhD or MD (3,070 games)
Cancer Knowledge (4,660 games)
B...
MOST RANDOM GENE EXPRESSION SIGNATURES ARE
SIGNIFICANTLY ASSOCIATED WITH BREAST CANCER
OUTCOME

Still need to pick gene se...
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Survival Prediction
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Survival Prediction
Upcoming SlideShare
Loading in …5
×

The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Survival Prediction

1,084 views

Published on

Keynote Presentation for Rocky Bioinformatics conference 2013. Its about http://genegames.org/cure/

Published in: Technology, Health & Medicine
  • Be the first to comment

  • Be the first to like this

The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Survival Prediction

  1. 1. THE CURE: A GAME WITH THE PURPOSE OF GENE SELECTION FOR BREAST CANCER SURVIVAL PREDICTION Benjamin Good*, Salvatore Loguercio, Max Nanis, Andrew Su The Scripps Research Institute http://genegames.org/cure/ Rocky 2013
  2. 2. A QUESTION How would you get 150 PhD level scientists to work together on the same problem? Without any money?
  3. 3. TRAIL MAP Games Survival Prediction The Cure
  4. 4. WHY GAMES? It is estimated that 9 billion hours are spent playing Solitaire every year Luis Von Ahn. : Google Tech Talk: Human Computation 2006. (Shortly after receiving $500,000 „Genius Grant‟ for this work)
  5. 5. Seven million hours of human labor ONE YEAR SOLITAIRE = 1,285 EMPIRE STATE BUILDINGS Empire State Building
  6. 6. 150 billion hours gaming each year What if we could use a tiny fraction of that human effort to achieve another purpose? empire state building 7M one year of solitaire one year of games 9B 150B McGonigal J. Reality is broken : why games make us better and how they can change the world. New York: Penguin Press; 2011.
  7. 7. PURPOSES Computer science Find objects inside images Tag songs Label all images on the Web Rate image quality Biology Figure out how proteins fold Teach computers English Design RNA molecules Build ontologies Map connections between neurons Link genes with diseases Assemble genomes Align DNA and protein sequences Tag Malaria parasites in blood smears Develop better treatments for breast cancer
  8. 8. GAMES WITH A PURPOSE MOLT The Cure
  9. 9. TRAIL MAP Games Survival Prediction The Cure
  10. 10. INFERRING SURVIVAL PREDICTORS 10 year Nosurvival? Yes make predictions on new samples find patterns 10 year survival? No Yes van't Veer, Laura J., et al. "Gene expression profiling predicts clinical outcome of breast cancer.” Nature 415.6871 (2002): 530-536.
  11. 11. INFERRING SURVIVAL PREDICTORS find patterns make predictions No 10 year survival? Yes 1) select genes Out of the 25,000+ genes, which small set works together the best? 2) infer predictor from data (e.g. decision tree, SVM, etc.)
  12. 12. PROBLEM: GENE SELECTION INSTABILITY instability: different methods, different datasets produce different gene sets for the same phenotype [1] [1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine 5.10 (2013).
  13. 13. PROBLEM: THE VALIDATION GAP training data, test data validation validation: predictive signatures often perform worse on independent data created for validation. Photograph by Richard Hallman, National Geographic Adventure Blog
  14. 14. ADDING PRIOR KNOWLEDGE TO THE DISCOVERY ALGORITHM make predictions find patterns <10 yr survival >10 yr survival
  15. 15. EX.) NETWORK GUIDED FORESTS Use network to find good gene combinations Dutkowski & Ideker (2011) Protein Networks as Logic Functions in Development in Development and Cancer. PLoS Computational Biology
  16. 16. BUT MOST KNOWLEDGE IS NOT STRUCTURED 1000000 950000 900000 850000 Number 800000 articles 750000 added to PubMed 700000 112 publications/hour (37 more by the end of this talk) 650000 600000 550000 500000 >160,000 publications linked to “breast cancer” since 2000 http://tinyurl.com/brsince2000
  17. 17. HOW CAN WE USE UNSTRUCTURED KNOWLEDGE FOR GENE SELECTION? Need an intelligent system that is good at reading and hypothesizing Like you
  18. 18. TRAIL MAP Games Survival Prediction The Cure
  19. 19. THE CURE HTTP://GENEGAMES.ORG/CURE/
  20. 20. education level? cancer knowledge? biologist?
  21. 21. PLAY = GENE SELECTION Opponents hand Alternate turns picking a gene from a “board” of 25 Your hand
  22. 22. SCORING Score reflects accuracy of decision tree created with just the selected genes on real training data Cure Server
  23. 23. PLAY WITH KNOWLEDGE: GENE ONTOLOGY
  24. 24. PLAY WITH KNOWLEDGE: GENE RIFS
  25. 25. YOU WIN!
  26. 26. COMMUNITY BOARD VIEW, CHOOSE OPEN BOARD You beat this one The community finished this board (e.g. 11 different players completed it) This board is still open
  27. 27. BOARDS • 25 genes each • randomly selected from 1,250 genes that passed an unsupervised filter for minimum expression level and variance for a particular dataset [1],[2] • 4 different 100 board rounds completed, each with some overlap • 3731 distinct genes used in the game [1] Curtis, Christina, et al. "The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups." Nature (2012) [2] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine (2013)
  28. 28. 1,077 Players registered (one year) http://io9.com/ these-cool-games-let-you-do-real-life-science-486173006 PLAYERS 250 Sage DREAM7 challenge, game announcement 200 Other 150 Did not state none New player registrations 100 BA MSc 50 PhD Au… Jul-… Jun… Ma… Apr… Ma… Fe… Jan… De… No… Oct… 0 Se… %PhD 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 MD
  29. 29. PLAYER DEMOGRAPHICS graduate_degree undergraduate none 800 350 300 250 Most 200 recent 150 degree 100 50 0 800 600 600 Cancer 400 knowledge? 200 Are you a 400 Biologist? 200 0 0 no ns yes no ns yes
  30. 30. GAMES PLAYED • 9,904 games (non training) Total games played per player games played, top 20 players 10000 800 PhD 700 1000 Total games played 600 MD 500 100 MS 400 300 10 PhD 200 100 1 0 0 200 400 600 800 0 5 Player PhD 10 15 20 25
  31. 31. GENE RANKINGS FROM GAMES make predictions find patterns <10 yr survival >10 yr survival
  32. 32. GENE RANKINGS FROM GAMES • For each gene: 1. O = number of times it appeared in a game (some genes occur on multiple boards, all boards are played multiple times, all occurrences are counted) 2. S = number of times it was selected by a player 3. F = S/0 • Games can be filtered based on player data • We can estimate an empirical P value for each value of O, S • P reflects the chances of getting S or more by chance given O Examples (all games): • B-cell lymphoma 2 gene: O = 13, S = 10, F = 10/13 = 0.77, P < 0.0001 • Alanine and arginine rich domain containing protein: O = 33, S = 3, F = 3/33 = 0.09, P = 0.91
  33. 33. GENES SELECTED BY ALL PLAYERS 9904 GAMES P<0.001, 60 GENES Top 10 enriched disease annotations n genes adj. P < 2.43e-06 background = 3731 genes used in any game Top 10 genes Wang, Jing, et al. "WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013." Nucleic acids research (2013).
  34. 34. GENES SELECTED BY PEOPLE: WITH PHDS WITH KNOWLEDGE OF CANCER, 2373 GAMES P<0.001, 82 GENES Top 10 enriched disease annotations “Expert Gene Set” n genes adj. P < 5.76e-08 Top 10 genes
  35. 35. GENES SELECTED BY PEOPLE: WITHOUT PHDS, WITH NO KNOWLEDGE OF CANCER, THAT ARE NOT BIOLOGISTS 3607 GAMES P<0.001 , 10 GENES Top 10 genes • Gene set not significantly enriched with any disease annotations
  36. 36. SELF REPORTING SEEMED TO WORK...
  37. 37. EVEN WITHOUT FILTERING, THE DATA CONTAINS THE KNOWLEDGE • “All Players” still contained significant cancer signal.
  38. 38. PROBLEM: GENE SELECTION INSTABILITY instability: different methods, different datasets produce different gene sets for the same phenotype [1] [1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine 5.10 (2013).
  39. 39. GENE SET OVERLAPS, SOME BUT NOT MUCH “Expert Gene Set” http://bioinformatics.psb.ugent.be/webtools/Venn/
  40. 40. PROBLEM: THE VALIDATION GAP training data, test data validation validation: predictive signatures often perform worse on independent data created for validation. Photograph by Richard Hallman, National Geographic Adventure Blog
  41. 41. CLASSIFIER PERFORMANCE WITH DIFFERENT GENE GROUPS, DIFFERENT DATASETS 10 year survival Yes No X-axis Test Set performance Griffith 2013 data “Expert Gene Set” Y-axis Test Set performance Metabric training Oslo Test Only difference between points, are the genes used to build SVM classifier
  42. 42. SUMMARY Plusses • 1 year • 1,000 players, 150 PhDs • 10,000 games • “expert knowledge” captured through an open game Minuses • New gene ranking method with results competitive with established approaches • Game is now in use in an undergraduate class • Did not make a significantly better breast cancer survival predictor • Game could have been better in many ways • no beginning, middle or end • random guessing can win • easy to cheat
  43. 43. NEXT STEPS • More fun • More learning for novices • More control for experts • More data
  44. 44. THE END Thanks to: Players!!!! Andrew Su Salvatore Loguercio Max Nanis Karthik Gangavarapu Funding More information at: http://genegames.org/cure/ bgood@scripps.edu @bgood We are hiring! Looking for postdocs, programmers interested in crowdsourcing and bioinformatics. Contact: asu@scripps.edu
  45. 45. GAMES WITH A PURPOSE of collecting expert level knowledge Khatib, Firas, et al. "Algorithm discovery by protein folding game players." Proceedings of the National Academy of Sciences (2011) Loguercio, Salvatore, et al. "Dizeez: an online game for human gene-disease annotation." PloS One (2013) MOLT The Cure
  46. 46. HUMAN GUIDED FOREST (HGF) Let CURE players build decision modules http://i9606.blogspot.com/2012/04/human-guided-forests-hgf.html
  47. 47. WHY DID YOU SIGN UP? (83 RESPONSES) Why did you sign up for The Cure? (select all that apply) 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% To help breast cancer research To learn something To have fun playing a game
  48. 48. WAS THE GAME FUN? 0.8 0.7 0.6 percent 0.5 0.4 0.3 0.2 0.1 0 Yes, it was very fun A little bit entertaining No, not at all
  49. 49. DO YOU KNOW ANYONE THAT HAS OR HAD BREAST CANCER? Have you known or do you currently know anyone that has or has had breast cancer? Yes No
  50. 50. DID YOU LEARN ANYTHING FROM PLAYING? 60 50 40 30 20 10 0 Yes, I felt like I learned a lot Yes, I learned a little bit No, I did not learn anything
  51. 51. MY KNOWLEDGE OF BREAST CANCER IS: 0.6 0.5 0.4 0.3 0.2 0.1 0 I am an expert in breast I have helped conduct I know some biology and I know a little biology, but Nothing, I do not know a cancer cancer research ias part have some understanding nothing specific to cancer thing about it of my job of what cancer is
  52. 52. AGE? Which category below includes your age? 17 or younger 18-20 21-29 30-39 40-49 50-59 60 and above
  53. 53. GENDER? What is your gender? Female Male
  54. 54. TRAINING LEVELS
  55. 55. the decision tree created using the feature “makes milk” is 100% correct on training data, you win!
  56. 56. TRAINING INTERFACE Choose the feature that best distinguishes mammals from other creatures
  57. 57. TRAINING INTERFACE the decision tree created using the feature “has hair” is 94% correct on training data, you win!
  58. 58. OVERLAP OF SIGNIFICANT GENE SETS FROM DIFFERENT CURE GAME FILTERS PhD or MD (3,070 games) Cancer Knowledge (4,660 games) Biologist (4,913 games) PhD & Cancer Knowledge (2,373 games) No Expertise (3,607 games)
  59. 59. MOST RANDOM GENE EXPRESSION SIGNATURES ARE SIGNIFICANTLY ASSOCIATED WITH BREAST CANCER OUTCOME Still need to pick gene sets Feature selection challenge still relevant Very useful grain of salt in interpreting these results.. Venet et al.(2011). PLoS Comp. Bio.

×