" at least a twofold difference and a P-value of less than 0.01 in more than five tumours”
a, Two-dimensional presentation of transcript ratios for 98 breast tumours. There were 4,968 significant genes across the group. Each row represents a tumour and each column a single gene. As shown in the colour bar, red indicates upregulation, green downregulation, black no change, and grey no data available. The yellow line marks the subdivision into two dominant tumour clusters. b, Selected clinical data for the 98 patients in a: BRCA1 germline mutation carrier (or sporadic patient), ER expression, tumour grade 3 (versus grade 1 and 2), lymphocytic infiltrate, angioinvasion, and metastasis status. White indicates positive, black negative and grey denotes tumours derived from BRCA1 germline carriers who were excluded from the metastasis evaluation. The cluster below the yellow line consists of 36 tumours, of which 34 are ER negative (total 39 ER-negative) and 16 are carriers of the BRCA1 mutation (total 18). c, Enlarged portion from a containing a group of genes that co-regulate with the ER- gene (ESR1). Each gene is labelled by its gene name or accession number from GenBank. Contig ESTs ending with RC are reverse-complementary of the named contig EST. d, Enlarged portion from a containing a group of co-regulated genes that are the molecular reflection of extensive lymphocytic infiltrate, and comprise a set of genes expressed in T and B cells. (Gene annotation as in c.)
(A) Average area under the ROC curve for NGF, RF, NGF applied to permuted networks (NGF**), and Naïve Bayes, compared to reported scores for representative previous methods (error bars denote standard deviation estimated over 100 runs). (B) General cancer and breast cancer associated genes identified among the 100 top-scoring genes or 100 most abundant genes in the forest created using RF or NGF. using the real network or networks with permuted edges (average over 100 permutations is shown). (C) Genes ranked by their importance for classification in two independent breast cancer patient cohorts (y vs. x axis). Network-Guided Forest, blue points; regular Random Forest, green points.
Human Guided Forests (HGF)
HUMAN GUIDED FORESTS Su lab meeting March 30, 2012 Benjamin Good
CHALLENGENeed to build biological class predictors that: 1. Have high accuracy 2. Use relatively few variables To do this we have to use datasets that: 1. Are very noisy2. Contain enormous numbers of variables
EXAMPLE: BREAST CANCER PROGNOSIS Van‟tVeer 2002 Nature 98 breast cancer samples: 34 developed metastases within 5 years, 44 did not 18 had BRCA1 mutations, 2 had BRCA2 mutations expression levels of 25,000 genes measured5,000 genes “significantly regulated across the sample groups”
98 tumorsgenes thatcoregulatewith ER 5000 genesco-regulatedgenesindicatinglymphocyticinfiltrate 70% bad 30% bad
METASTASIS PREDICTOR231 genes were found to be significantly associated with disease outcomeUsing leave-one-out cross-validation they empirically selected the 70 best individual genes to build their predictor Of the 78 samples in the training set the predictor correctly classified 65 (83%) ->MammaPrint test from Agendia Still in clinical trials (10 years since original study) (MINDACT)
WE CAN DO BETTER This signature does not take advantage of: • interactions between genes (together two variables may be much more predictive then either one alone) • biological knowledge(this signature leaves out several known cancer predictors and does not make use of biological knowledge in any way)
THERE ARE MANY MANY CHALLENGES LIKE THIS IN BIOLOGY
WE CAN DO BETTER BY INTEGRATING MACHINE LEARNING WITH BIOLOGICAL EXPERTISE The standard signature does not take advantage of: • interactions between genes machine learning algorithms can find and use these but can have problems when faced with large feature spaces • biological knowledge can be used to guide the machine learning process towardsmeaningful features in the data and thus reduce chances of overfitting
MACHINE LEARNINGALGORITHM OF THE MOMENT • In each of many iterations, a small subset of features are chosen randomly and used to build one decision tree • Decision trees are stored and classifications are made based on the majority vote of all of the trees. • Good classifier! • But you get different forests every time you run it and it faces the same challenges of generalizability as any other learning algorithm.
NETWORK GUIDED FOREST (NGF) Same algorithm except each tree is constructed from a particular area of a relevant protein-protein interaction network. 1) Pick a gene randomly 2) Walk out along the network to get the other N genes to use to build that tree 3) repeat The premise is that biologically coherent modules will give better signal than individual genes randomly grouped togetherDutkowski & Ideker (2011) Protein Networks as Logic Functions in Development in Development and Cancer. PLoS Computational Biology
NGF RESULTSA) Identical performance to random forest and random network guided forest as assessed by 5 fold cross-validation repeated 100 times.B) More known breast cancer genes show up in the forestC) Similar genes selected for forests in two different training sets (different patient cohorts)
HUMAN GUIDED RANDOM FOREST (HGF) Same algorithm again except each trees are constructed from a manually selected subset of genes (or other features). 1) Find a person 2) Let them select what they think is an optimal feature set 3) back to step one, N times 4) aggregate The premise is that biological knowledge can produce better than random decision modules and that not all biological knowledge is captured in interaction networks
HGF CHALLENGES1) Find a person2) Let them select what they think is an optimal feature set3) back to step one, N times• N may be large (e.g. 1,000)• Need many knowledgeable people to work hard... for free
COMBO CODE AND DEMO Next steps1. Better preprocessing of training data• map contigs to genes where possible, filter out clearly useless genes• identify individually predictive genes• Game• build domain-specific boards, let players pick their knowledge area• Real two player feeling with robot partner• Special cards: robber card, any-gene selector card• High scores• ??????????????