The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Survival Prediction

THE CURE: A GAME WITH THE PURPOSE OF
GENE SELECTION FOR BREAST CANCER
SURVIVAL PREDICTION
Benjamin Good*, Salvatore Loguercio, Max Nanis, Andrew Su
The Scripps Research Institute
http://genegames.org/cure/
Rocky 2013

A QUESTION

How would you get 150 PhD level scientists
to work together on the same problem?

Without any money?

TRAIL MAP

Games
Survival Prediction
The Cure

WHY GAMES?

It is estimated that 9 billion
hours are spent playing
Solitaire every year

Luis Von Ahn. : Google Tech Talk: Human Computation 2006.
(Shortly after receiving $500,000 „Genius Grant‟ for this work)

Seven million hours of human labor

ONE YEAR SOLITAIRE =
1,285 EMPIRE STATE
BUILDINGS

Empire State Building

150 billion hours gaming each year

What if we could use a tiny fraction of that
human effort to achieve another purpose?
empire state
building
7M

one year of solitaire one year of games
9B

150B

McGonigal J. Reality is broken : why games make us better and how they can
change the world. New York: Penguin Press; 2011.

PURPOSES
Computer
science
Find objects
inside
images
Tag songs

Label all images
on the Web

Rate image
quality

Biology
Figure out how
proteins fold

Teach computers
English

Design RNA
molecules

Build ontologies
Map connections
between neurons

Link genes with
diseases

Assemble
genomes

Align DNA and
protein sequences

Tag Malaria parasites
in blood smears

Develop better
treatments for
breast cancer

GAMES WITH A PURPOSE

MOLT
The Cure

INFERRING SURVIVAL PREDICTORS
10 year
Nosurvival?

Yes

make predictions on new samples

find patterns

10 year survival?
No

Yes

van't Veer, Laura J., et al. "Gene expression profiling predicts clinical outcome of breast cancer.” Nature 415.6871 (2002): 530-536.

INFERRING SURVIVAL PREDICTORS
find patterns

make predictions
No

10 year survival?
Yes

1) select genes

Out of the 25,000+ genes, which
small set works together the best?

2) infer predictor from data (e.g. decision tree, SVM, etc.)

PROBLEM: GENE SELECTION INSTABILITY

instability: different methods, different datasets
produce different gene sets for the same phenotype [1]

[1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine 5.10 (2013).

PROBLEM: THE VALIDATION GAP

training
data, test
data
validation
validation: predictive signatures often perform
worse on independent data created for validation.

Photograph by Richard Hallman, National Geographic Adventure Blog

ADDING PRIOR KNOWLEDGE TO THE DISCOVERY
ALGORITHM
make predictions
find patterns

<10 yr
survival
>10 yr
survival

EX.) NETWORK GUIDED FORESTS

Use network to find
good gene
combinations

Dutkowski & Ideker (2011) Protein Networks as Logic Functions in Development in Development and Cancer. PLoS Computational Biology

BUT MOST KNOWLEDGE IS NOT STRUCTURED
1000000
950000
900000
850000

Number 800000
articles
750000
added to
PubMed 700000

112 publications/hour
(37 more by the end of this talk)

650000

600000
550000
500000

>160,000 publications linked to “breast cancer” since 2000
http://tinyurl.com/brsince2000

HOW CAN WE USE UNSTRUCTURED
KNOWLEDGE FOR GENE SELECTION?

Need an intelligent system that is good at reading and hypothesizing

Like you

THE CURE

HTTP://GENEGAMES.ORG/CURE/

education level?
cancer knowledge?

biologist?

PLAY = GENE SELECTION
Opponents
hand

Alternate turns
picking a gene from
a “board” of 25

Your
hand

SCORING
Score reflects accuracy of
decision tree created with
just the selected genes
on real training data

Cure Server

PLAY WITH KNOWLEDGE: GENE ONTOLOGY

PLAY WITH KNOWLEDGE: GENE RIFS

COMMUNITY BOARD VIEW,
CHOOSE OPEN BOARD
You beat this one

The community
finished this board
(e.g. 11 different
players completed it)

This board is still open

BOARDS
• 25 genes each

• randomly selected from 1,250 genes that passed an
unsupervised filter for minimum expression level and variance
for a particular dataset [1],[2]
• 4 different 100 board rounds completed, each with some overlap
• 3731 distinct genes used in the game

[1] Curtis, Christina, et al. "The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups." Nature (2012)
[2] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine (2013)

1,077 Players registered (one year)
http://io9.com/
these-cool-games-let-you-do-real-life-science-486173006

PLAYERS
250

Sage DREAM7
challenge, game
announcement

200
Other
150

Did not state
none

New player
registrations 100

BA
MSc

50

PhD

Au…

Jul-…

Jun…

Ma…

Apr…

Ma…

Fe…

Jan…

De…

No…

Oct…

0

Se…

%PhD

0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0

MD

PLAYER DEMOGRAPHICS
graduate_degree
undergraduate

none

800

350
300
250
Most
200
recent
150
degree 100
50
0

800

600

600

Cancer
400
knowledge?
200

Are you a
400
Biologist?
200

0

0
no

ns

yes

no

ns

yes

GAMES PLAYED

• 9,904 games (non training)

Total games played per player

games played, top 20 players

10000

800

PhD

700

1000
Total
games
played

600

MD

500

100

MS

400
300

10

PhD

200
100

1

0

0

200

400

600

800

0

5

Player

PhD

10

15

20

25

GENE RANKINGS FROM GAMES
make predictions
find patterns

<10 yr
survival
>10 yr
survival

GENE RANKINGS FROM GAMES
•

For each gene:
1. O = number of times it appeared in a game (some genes occur on multiple boards, all
boards are played multiple times, all occurrences are counted)
2. S = number of times it was selected by a player
3. F = S/0

•

Games can be filtered based on player data

•

We can estimate an empirical P value for each value of O, S

•

P reflects the chances of getting S or more by chance given O

Examples (all games):
•

B-cell lymphoma 2 gene:
O = 13, S = 10, F = 10/13 = 0.77, P < 0.0001

•

Alanine and arginine rich domain containing protein:

O = 33, S = 3, F = 3/33 = 0.09, P = 0.91

GENES SELECTED BY ALL PLAYERS
9904 GAMES
P<0.001, 60 GENES
Top 10 enriched disease annotations

n genes

adj. P < 2.43e-06
background = 3731 genes
used in any game

Top 10 genes

Wang, Jing, et al. "WEB-based GEne SeT
AnaLysis Toolkit (WebGestalt): update 2013."
Nucleic acids research (2013).

GENES SELECTED BY PEOPLE:
WITH PHDS
WITH KNOWLEDGE OF CANCER,
2373 GAMES
P<0.001, 82 GENES
Top 10 enriched disease annotations

“Expert Gene Set”
n genes

adj. P < 5.76e-08
Top 10 genes

GENES SELECTED BY PEOPLE:
WITHOUT PHDS,
WITH NO KNOWLEDGE OF CANCER,
THAT ARE NOT BIOLOGISTS
3607 GAMES
P<0.001 , 10 GENES
Top 10 genes

• Gene set not
significantly enriched
with any disease
annotations

SELF REPORTING SEEMED TO WORK...

EVEN WITHOUT FILTERING, THE DATA CONTAINS
THE KNOWLEDGE
•

“All Players” still contained significant cancer signal.

GENE SET OVERLAPS, SOME BUT NOT MUCH

http://bioinformatics.psb.ugent.be/webtools/Venn/

CLASSIFIER PERFORMANCE WITH DIFFERENT
GENE GROUPS, DIFFERENT DATASETS
10 year survival
Yes
No

X-axis Test Set performance
Griffith 2013 data

Y-axis Test Set performance
Metabric training Oslo Test

Only difference between
points, are the genes used to
build SVM classifier

SUMMARY
Plusses
•

1 year

•

1,000 players, 150 PhDs

•

10,000 games

•

“expert knowledge” captured through an
open game

Minuses

•

New gene ranking method with results
competitive with established approaches

•

Game is now in use in an undergraduate
class

•

Did not make a significantly better breast
cancer survival predictor

•

Game could have been better in many ways
• no beginning, middle or end
• random guessing can win
• easy to cheat

NEXT STEPS
•

More fun

•

More learning for novices

•

More control for experts

•

More data

THE END
Thanks to:
Players!!!!
Andrew Su
Salvatore Loguercio
Max Nanis
Karthik Gangavarapu
Funding

More information at:
http://genegames.org/cure/
bgood@scripps.edu
@bgood
We are hiring! Looking for
postdocs, programmers
interested in crowdsourcing
and bioinformatics.
Contact: asu@scripps.edu

GAMES WITH A PURPOSE

of collecting expert level knowledge

Khatib, Firas, et al. "Algorithm discovery by
protein folding game players." Proceedings of
the National Academy of Sciences (2011)

Loguercio, Salvatore, et al.
"Dizeez: an online game for
human gene-disease
annotation." PloS One (2013)

MOLT
The Cure

HUMAN GUIDED FOREST (HGF)

Let CURE players build
decision modules

http://i9606.blogspot.com/2012/04/human-guided-forests-hgf.html

WHY DID YOU SIGN UP? (83 RESPONSES)
Why did you sign up for The Cure? (select all that apply)
90.0%
80.0%
70.0%
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
To help breast cancer research

To learn something

To have fun playing a game

WAS THE GAME FUN?
0.8
0.7
0.6

percent

0.5
0.4
0.3
0.2
0.1
0
Yes, it was very fun

A little bit entertaining

No, not at all

DO YOU KNOW ANYONE THAT HAS OR HAD
BREAST CANCER?
Have you known or do you currently know anyone that has or has had breast cancer?

Yes
No

DID YOU LEARN ANYTHING FROM PLAYING?
60
50
40
30
20
10
0
Yes, I felt like I learned a lot

Yes, I learned a little bit

No, I did not learn anything

MY KNOWLEDGE OF BREAST CANCER IS:
0.6

0.5

0.4

0.3

0.2

0.1

0
I am an expert in breast I have helped conduct I know some biology and I know a little biology, but Nothing, I do not know a
cancer
cancer research ias part have some understanding nothing specific to cancer
thing about it
of my job
of what cancer is

AGE?
Which category below includes your age?

17 or younger
18-20
21-29
30-39
40-49
50-59
60 and above

GENDER?
What is your gender?

Female
Male

the decision tree created using the
feature “makes milk” is 100%
correct on training data, you win!

TRAINING INTERFACE

Choose the feature that best
distinguishes mammals from other
creatures

TRAINING INTERFACE

the decision tree created using the
feature “has hair” is 94% correct
on training data, you win!

OVERLAP OF SIGNIFICANT GENE SETS FROM
DIFFERENT CURE GAME FILTERS
PhD or MD (3,070 games)
Cancer Knowledge (4,660 games)
Biologist (4,913 games)

PhD & Cancer Knowledge (2,373 games)

No Expertise (3,607 games)

MOST RANDOM GENE EXPRESSION SIGNATURES ARE
SIGNIFICANTLY ASSOCIATED WITH BREAST CANCER
OUTCOME

Still need to pick gene sets
Feature selection challenge still relevant
Very useful grain of salt in interpreting these results..

Venet et al.(2011). PLoS Comp. Bio.

The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Survival Prediction

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Similar to The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Survival Prediction

Similar to The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Survival Prediction (20)

More from Benjamin Good

More from Benjamin Good (20)

Recently uploaded

Recently uploaded (20)

The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Survival Prediction

Editor's Notes