An online game for improving human phenotype prediction Benjamin M Good, Salvatore Loguercio, Andrew I Su The Scripps Research Institute, La Jolla, California, USA ABSTRACT ABSTRACT Motivation Combo: feature selection with community intelligenceAn important goal for biomedical research is to produce genetic and • Goal: pick the best set of genesgenomic predictors for human phenotypes such as disease prognosis ordrug response. To this end, we can now quantify an extremely large • Using prior biological knowledge, it is possible • Best: the gene set that produces the best decision tree classifiernumber of potential biomarkers for any biological sample. In fact, a to identify stronger, more consistent • Classifier: created using training data and selected genes, used tosingle sample could reasonably be described by millions of molecular predict phenotype (e.g. breast cancer prognosis)variations in DNA, RNA, proteins, and metabolites. However, the actual predictive patterns.number of samples processed typically remains small in comparison. As aresult, attempts to use this data to build predictors often face problems A game board A handof overfitting. (While a predictive pattern may describe training datavery well, it may not reproduce well on other datasets.) • Prior knowledgeIt has recently been shown that biological knowledge in the form of gene encoded in protein-annotations and pathway databases can be used to guide the process of protein interactioninferring phenotype predictors [1-3]. While promising, such methods arelimited by the amount, quality and problem-specific applicability of the databases [1,2] andstructured knowledge that is available. pathway databases  has been used toFollowing in the line of games that have recently demonstrated successas a means of ‘crowdsourcing’ difficult biological problems [4,5], we are improve phenotype Inferreddeveloping games with the purpose of improving human phenotype prediction Score: 78 (percent correct) decision treepredictions. Our games work on two levels: (1) games such as Dizeezand GenESP collect novel gene annotations and (2) games like Combo Game Score: determined by Phenotype 1 Network Guided Forest from Dutkowski et al (2011)engage players directly in the process of predictor inference. estimating performance of trees Phenotype 2 constructed using the selected Feature sets from manyPlay game prototypes at: http://www.genegames.org • What about knowledge that is not recorded in features on training data. individual games used to create a Decision Tree Forest classifier.(Also see Poster I03) structured databases? (Each tree votes once.) Challenge Opportunity Human Guided Forest Ensemble classifier where make predictions on • Online games are successfully tapping into the components are decision cancer normal new samples knowledge and reasoning abilities of trees constructed using thousands of people. manually selected subsets of find patterns features. Adaptation of cancer Network Guided and Random Forests . normal Label all images on the Web Devise protein folding algorithms REFERENCES 1. Dutkowski and Ideker (2011) Protein Networks as Logic Functions in Development and Cancer. PLoS Computational Biology Design RNA molecules Fix multiple sequence alignments 2. Winter et al (2012) Google Goes Cancer: Improving Outcome Prediction for Cancer Patients by Network-Based Ranking of Marker Genes. PLoS Computational Biology 3. Liu et al (2012) Identifying dysregulated pathways in cancers from pathway interaction networks. BMC Bioinformatics • With tens of thousands of measurements • COMBO is designed to motivate and enable 4. Good and Su (2011) Games with a Scientific Purpose. Genome Biology 5. Kawrykow et al (2012) Phylo: A Citizen Science Approach for Improving Multiple Sequence Alignment. PLoS One but only hundreds of samples, many people to help improve phenotype predictors possible patterns are found. CONTACT • But which ones are real? Benjamin Good: email@example.com Salvatore Loguercio: firstname.lastname@example.org Andrew Su: email@example.com FUNDING select predictive gene sets We acknowledge support from the National Institute of General Medical Sciences (GM089820 and GM083924) and the NIH through the FaceBase Consortium for a particular emphasis on craniofacial genes (DE-20057). .