Games for improving human phenotype prediction Benjamin M Good, Salvatore Loguercio, Andrew I Su The Scripps Research Institute, La Jolla, California, USA ABSTRACT ABSTRACT Dizeez: gene – disease annotation quiz Combo: feature selection with community intelligenceAn important goal for biomedical research is to produce genetic and Select the disease related to the clue • Goal: pick the best set of genesgenomic predictors for human phenotypes such as disease prognosis or gene. Guess as many as you can in • Best: the gene set that produces the best decision tree classifierdrug response. To this end, we can now quantify an extremely large one minute. • Classifier: created using training data and selected genes, used tonumber of potential biomarkers for any biological sample. In fact, asingle sample could reasonably be described by millions of molecular predict phenotype (e.g. breast cancer prognosis)variations in DNA, RNA, proteins, and metabolites. However, the actual Every guess adds weight to a linknumber of samples processed typically remains small in comparison. As a between a gene and a disease.result, attempts to use this data to build predictors often face problems A game board A handof overfitting. (While a predictive pattern may describe training datavery well, it may not reproduce well on other datasets.) Preliminary Results 713 games, 180 players;It has recently been shown that biological knowledge in the form of geneannotations and pathway databases can be used to guide the process ofinferring phenotype predictors [1-3]. While promising, such methods are Overall: 4,585 unique gene-limited by the amount, quality and problem-specific applicability of the disease assertions.structured knowledge that is available. 224 assertions provided moreFollowing in the line of games that have recently demonstrated success than once and not found inas a means of ‘crowdsourcing’ difficult biological problems [4,5], we are OMIM/PharmGKB.developing games with the purpose of improving human phenotype Inferred Score: 78 (percent correct) decision treepredictions. Our games work on two levels: (1) games such as Dizeez Top associationsand GenESP collect novel gene annotations and (2) games like Combo provided four or more Game Score: determined byengage players directly in the process of predictor inference. times and not found in estimating performance of trees constructed using the selected Feature sets from many OMIM/PharmGKB. features on training data. individual games used to createPlay game prototypes at: http://www.genegames.org a Decision Tree Forest classifier. Even after limited game playing, the Dizeez game resulted in the (Each tree votes once.) identification of several novel gene-disease annotations. Game Objectives Human Guided Forest GeneESP: gene – concept association with a partner Ensemble classifier where Phenotype • Capture general components are decision trees constructed using community manually selected subsets of knowledge in a features. Adaptation of gene pathway useful structure Network Guided and Random Forests [1,2]. gene Community Guess what genes your partner REFERENCES is thinking about when they 1. Dutkowski and Ideker (2011) Protein Networks as Logic Functions in Development and Cancer. PLoS Computational Biology see ‘neuroblastoma’ 2. Winter et al (2012) Google Goes Cancer: Improving Outcome Prediction for Cancer Patients by Network-Based Ranking of Marker Genes. PLoS Computational Biology • Concentrate Improvements compared to Dizeez: 3. Liu et al (2012) Identifying dysregulated pathways in cancers from pathway interaction networks. BMC Bioinformatics 4. Good and Su (2011) Games with a Scientific Purpose. Genome Biology community knowledge • Reward new, useful annotations with points 5. Kawrykow et al (2012) Phylo: A Citizen Science Approach for Improving Multiple Sequence Alignment. PLoS One and reasoning around • Add social interaction CONTACT predicting a particular • Enable gene-gene, gene-disease, gene-function Benjamin Good: email@example.com Salvatore Loguercio: firstname.lastname@example.org Andrew Su: email@example.com phenotype games on the same platform • Increase scalability of annotation collection (does FUNDING Phenotype 1 We acknowledge support from the National Institute of General Medical Sciences (GM089820 and not depend on a database of ‘right’ answers) GM083924) and the NIH through the FaceBase Consortium for a particular emphasis on Phenotype 2 craniofacial genes (DE-20057). .