Symmetrical2

98
-1

Published on

EXPERT SYSTEMS AND SOLUTIONS
Project Center For Research in Power Electronics and Power Systems
IEEE 2010 , IEEE 2011 BASED PROJECTS FOR FINAL YEAR STUDENTS OF B.E
Email: expertsyssol@gmail.com,
Cell: +919952749533, +918608603634
www.researchprojects.info
OMR, CHENNAI
IEEE based Projects For
Final year students of B.E in
EEE, ECE, EIE,CSE
M.E (Power Systems)
M.E (Applied Electronics)
M.E (Power Electronics)
Ph.D Electrical and Electronics.
Training
Students can assemble their hardware in our Research labs. Experts will be guiding the projects.
EXPERT GUIDANCE IN POWER SYSTEMS POWER ELECTRONICS
We provide guidance and codes for the for the following power systems areas.
1. Deregulated Systems,
2. Wind power Generation and Grid connection
3. Unit commitment
4. Economic Dispatch using AI methods
5. Voltage stability
6. FLC Control
7. Transformer Fault Identifications
8. SCADA - Power system Automation

we provide guidance and codes for the for the following power Electronics areas.
1. Three phase inverter and converters
2. Buck Boost Converter
3. Matrix Converter
4. Inverter and converter topologies
5. Fuzzy based control of Electric Drives.
6. Optimal design of Electrical Machines
7. BLDC and SR motor Drives

Published in: Technology, Health & Medicine
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
98
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Symmetrical2

  1. 1. EXPERT SYSTEMS AND SOLUTIONS Email: expertsyssol@gmail.com expertsyssol@yahoo.com Cell: 9952749533 www.researchprojects.info PAIYANOOR, OMR, CHENNAI Call For Research Projects Final year students of B.E in EEE, ECE, EI, M.E (Power Systems), M.E (Applied Electronics), M.E (Power Electronics) Ph.D Electrical and Electronics.Students can assemble their hardware in our Research labs. Experts will be guiding the projects.
  2. 2. Classification of Microarray Gene Expression Data Geoff McLachlanDepartment of Mathematics & Institute for Molecular Bioscience University of Queensland
  3. 3. Institute for Molecular Bioscience,University of Queensland
  4. 4. “A wide range of supervised andunsupervised learning methods have beenconsidered to better organize data, be it toinfer coordinated patterns of geneexpression, to discover molecularsignatures of disease subtypes, or to derivevarious predictions. ”Statistical Methods for Gene Expression:Microarrays and Proteomics
  5. 5. Outline of Talk• Introduction• Supervised classification of tissue samples – selection bias• Unsupervised classification (clustering) of tissues – mixture model-based approach
  6. 6. Vital Statisticsby C. TilstoneNature 424, 610-612, 2003. “DNA microarrays have given geneticists and molecular biologists access to more data than ever before. But do these Branching out: cluster analysis can group researchers have the samples that show similar patterns of gene statistical know-how to expression. cope?”
  7. 7. MICROARRAY DATAREPRESENTED by a p × n matrix ( x1 ,  , x n ) xj contains the gene expressions for the p genes of the jth tissue sample (j = 1, …, n). p = No. of genes (103 - 104) n = No. of tissue samples (10 - 102) STANDARD STATISTICAL METHODOLOGY APPROPRIATE FOR n >> p HERE p >> n
  8. 8. Two Groups in Two Dimensions. All cluster information would be lost by collapsing to the first principal component. Theprincipal ellipses of the two groups are shown as solid curves.
  9. 9. bioArray News (2, no. 35, 2002)Arrays Hold Promise for Cancer DiagnosticsOncologists would like to use arrays to predictwhether or not a cancer is going to spread in thebody, how likely it will respond to a certain typeof treatment, and how long the patient willprobably survive.It would be useful if the gene expressionsignatures could distinguish between subtypes oftumours that standard methods, such ashistological pathology from a biopsy, fail todiscriminate, and that require different treatments.
  10. 10. van’t Veer & De Jong (2002, Nature Medicine 8)The microarray way to tailored cancer treatment In principle, gene activities that determine the biological behaviour of a tumour are more likely to reflect its aggressiveness than general parameters such as tumour size and age of the patient.(indistinguishable disease states in diffuse large B-celllymphoma unravelled by microarray expression profiles– Shipp et al., 2002, Nature Med. 8)
  11. 11. Microarray to be used as routineclinical screenby C. M. SchubertNature Medicine9, 9, 2003. The Netherlands Cancer Institute in Amsterdam is to become the first institution in the world to use microarray techniques for the routine prognostic screening of cancer patients. Aiming for a June 2003 start date, the center will use a panoply of 70 genes to assess the tumor profile of breast cancer patients and to determine which women will receive adjuvant treatment after surgery.
  12. 12. Microarrays also to be used in theprediction of breast cancer by Mike West(Duke University) and the KooFoundation Sun Yat-Sen Cancer Centre,Taipei Huang et al. (2003, The Lancet, Gene expression predictors of breast cancer).
  13. 13. CLASSIFICATION OF TISSUES SUPERVISED CLASSIFICATION (DISCRIMINANT ANALYSIS) We OBSERVE the CLASS LABELS y1, …, yn where yj = i if jth tissue sample comes from the ith class (i=1,…,g).AIM: TO CONSTRUCT A CLASSIFIER C(x) FORPREDICTING THE UNKNOWN CLASS LABEL yOF A TISSUE SAMPLE x. e.g. g = 2 classes G1 - DISEASE-FREE G2 - METASTASES
  14. 14. LINEAR CLASSIFIERFORM C ( x) = β 0 + β x T = β0 + β1 x1 +  + β p x pfor the production of the group label y ofa future entity with feature vector x.
  15. 15. FISHER’S LINEAR DISCRIMINANT FUNCTION y = sign C ( x ) −1 where β = S ( x1 − x 2 ) 1 β 0 = − ( x1 − x 2 ) S ( x1 − x 2 ) T −1 2 and x1 , x 2 , and S are the sample means and pooled sample covariance matrix found from the training data
  16. 16. SUPPORT VECTOR CLASSIFIER Vapnik (1995) C ( x ) = β0 + β1 x1 +  + β p x pwhere β0 and β are obtained as follows: n 1 β + γ ∑ξ j 2 min β , β0 2 j =1 subject to ξ j ≥ 0, y j C(x j ) ≥ 1 − ξ j ( j = 1, , n) ξ1 ,, ξ n relate to the slack variables γ = ∞ separable case
  17. 17. n β = ∑α j y j x j ˆ ˆ j =1with non-zero α j only for those observations j for which the ˆconstraints are exactly met (the support vectors). n C ( x ) = ∑α j y j x T x + β 0 ˆ j ˆ j =1 n = ∑α j y j x j , x + β 0 ˆ ˆ j =1
  18. 18. Support Vector Machine (SVM)REPLACE x by h( x ) n C ( x ) = ∑ α j h( x j ), h( x ) + β 0 ˆ ˆ j =1 n = ∑α j K ( x j , x) + β 0 ˆ ˆ j =1where the kernel function K ( x j , x ) = h( x j ), h( x )is the inner product in the transformed feature space.
  19. 19. HASTIE et al. (2001, Chapter 12)The Lagrange (primal function) is [ ] n n n 1 β + γ ∑ ξ j − ∑ α j y j C ( x j ) − (1 − ξ j ) − ∑ λ jξ j 2LP = (1) 2 j =1 j =1 j =1which we maximize w.r.t. β, β0, and ξj.Setting the respective derivatives to zero, we get n β = ∑α j y j x j (2) j =1 n ∆ = ∑α j y j (3) j =1 α j = γ − λj ( j = 1, , n). (4) with α j ≥ 0, λ j ≥ 0, and ξ j ≥ 0 ( j = 1,  , n).
  20. 20. By substituting (2) to (4) into (1), we obtain the Lagrangian dualfunction n 1 n n LD = ∑ α j − ∑∑α α j k y j yk x x k T j (5) j =1 2 j =1 k =1 nWe maximize (5) subject to 0 ≤ α j ≤ γ and ∑α j =1 j y j = 0.In addition to (2) to (4), the constraints include α j [ y j C (x j ) − (1 − ξ j )] = 0 (6) λ jξ j = 0 (7) y j C (x j ) − (1 − ξ j ) ≥ 0 (8) for j = 1, , n.Together these equations (2) to (8) uniquely characterize the solutionto the primal and dual problem.
  21. 21. Leo Breiman (2001) Statistical modeling: the two cultures (with discussion). Statistical Science 16, 199-231.Discussants include Brad Efron and David Cox
  22. 22. Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlanProceedings of the National Academy of Sciences Vol. 99, Issue 10, 6562-6566, May 14, 2002 http://www.pnas.org/cgi/content/full/99/10/6562
  23. 23. GUYON, WESTON, BARNHILL & VAPNIK (2002, Machine Learning)• COLON Data (Alon et al., 1999)• LEUKAEMIA Data (Golub et al., 1999)
  24. 24. Since p>>n, consideration given toselection of suitable genesSVM: FORWARD or BACKWARD (in terms of magnitude of weight βi) RECURSIVE FEATURE ELIMINATION (RFE)FISHER: FORWARD ONLY (in terms of CVE)
  25. 25. GUYON et al. (2002)LEUKAEMIA DATA: Only 2 genes are needed to obtain a zero CVE (cross-validated error rate)COLON DATA: Using only 4 genes, CVE is 2%
  26. 26. GUYON et al. (2002)“The success of the RFE indicates that RFE has a built in regularization mechanism that we do not understand yet that prevents overfitting the training data in its selection of gene subsets.”
  27. 27. Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples
  28. 28. Figure 2: Error rates of the SVM rule with RFE procedureaveraged over 50 random splits of leukemia tissue samples
  29. 29. Figure 3: Error rates of Fisher’s rule with stepwise forward selection procedure using all the colon data
  30. 30. Figure 4: Error rates of Fisher’s rule with stepwise forward selection procedure using all the leukemia data
  31. 31. Figure 5: Error rates of the SVM rule averaged over 20 noninformative samples generated by random permutations of the class labels of the colon tumor tissues
  32. 32. Error Rate EstimationSuppose there are two groups G1 and G2C(x) is a classifier formed from thedata set (x1, x2, x3,……………, xn)The apparent error is the proportion ofthe data set misallocated by C(x).
  33. 33. Cross-ValidationFrom the original data set, remove x1 togive the reduced set (x2, x3,……………, xn)Then form the classifier C(1)(x ) from thisreduced set.Use C(1)(x1) to allocate x1 to either G1 orG2.
  34. 34. Repeat this process for the second datapoint, x2.So that this point is assigned to either G1 orG2 on the basis of the classifier C(2)(x2).And so on up to xn.
  35. 35. Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples
  36. 36. ADDITIONAL REFERENCESSelection bias ignored: XIONG et al. (2001, Molecular Genetics and Metabolism) XIONG et al. (2001, Genome Research) ZHANG et al. (2001, PNAS)Aware of selection bias: SPANG et al. (2001, Silico Biology) WEST et al. (2001, PNAS) NGUYEN and ROCKE (2002)
  37. 37. BOOTSTRAP APPROACH Efron’s (1983, JASA) .632 estimator B.632 = .368 × AE + .632 × B1 *where B1 is the bootstrap when rule R k is applied to a point not inthe training sample.A Monte Carlo estimate of B1 is n B1 = ∑ Ej n j =1 K K where Ej = ∑ IjkQjk ∑I jk k =1 k =1 if xj ∉ kth bootstrap sample with Ijk = 1 0  otherwise   * and Qjk = 1 if R k misallocates xj  0 otherwise 
  38. 38. Toussaint & Sharpe (1975) proposed the ERROR RATE ESTIMATOR A(w) = (1 - w)AE + wCV2E where w = 0.5McLachlan (1977) proposed w=wo where wo ischosen to minimize asymptotic bias of A(w) in thecase of two homoscedastic normal groups. Value of w0 was found to range between 0.6 n1and 0.7, depending on the values of p, ∆, and n . 2
  39. 39. .632+ estimate of Efron & Tibshirani (1997, JASA) B.632 + = (1 - w)AE + w B1 .632 where w= 1 − .368r B1 − AE r= (relative overfitting rate) γ − AE g γ = ∑ pi (1 − qi ) (estimate of no information error rate) i =1 If r = 0, w = .632, and so B.632+ = B.632 r = 1, w = 1, and so B.632+ = B1
  40. 40. One concern is the heterogeneity of the tumours themselves, which consist of a mixture of normal and malignant cells, with blood vessels in between. Even if one pulled out some cancer cells from a tumour, there is no guarantee that those are the cells that are going to metastasize, just because tumours are heterogeneous.“What we really need are expression profiles fromhundreds or thousands of tumours linked to relevant,and appropriate, clinical data.” John Quackenbush
  41. 41. UNSUPERVISED CLASSIFICATION (CLUSTER ANALYSIS) INFER CLASS LABELS y1, …, yn of x1, …, xnInitially, hierarchical distance-based methodsof cluster analysis were used to cluster thetissues and the genesEisen, Spellman, Brown, & Botstein (1998, PNAS)
  42. 42. Hierarchical (agglomerative) clustering algorithms are largely heuristically motivated and there exist a number of unresolved issues associated with their use, including how to determine the number of clusters. “in the absence of a well-grounded statistical model, it seems difficult to define what is meant by a ‘good’ clustering algorithm or the ‘right’ number of clusters.”(Yeung et al., 2001, Model-Based Clustering and Data Transformationsfor Gene Expression Data, Bioinformatics 17)
  43. 43. Attention is now turning towards a model-basedapproach to the analysis of microarray dataFor example:• Broet, Richarson, and Radvanyi (2002). Bayesian hierarchical modelfor identifying changes in gene expression from microarrayexperiments. Journal of Computational Biology 9•Ghosh and Chinnaiyan (2002). Mixture modelling of gene expressiondata from microarray experiments. Bioinformatics 18•Liu, Zhang, Palumbo, and Lawrence (2003). Bayesian clustering withvariable and transformation selection. In Bayesian Statistics 7• Pan, Lin, and Le, 2002, Model-based cluster analysis of microarraygene expression data. Genome Biology 3• Yeung et al., 2001, Model based clustering and data transformationsfor gene expression data, Bioinformatics 17
  44. 44. The notion of a cluster is not easy to define.There is a very large literature devoted toclustering when there is a metric known inadvance; e.g. k-means. Usually, there is no apriori metric (or equivalently a user-defineddistance matrix) for a cluster analysis.That is, the difficulty is that the shape of theclusters is not known until the clusters havebeen identified, and the clusters cannot beeffectively identified unless the shapes areknown.
  45. 45. In this case, one attractive feature ofadopting mixture models with ellipticallysymmetric components such as the normalor t densities, is that the implied clusteringis invariant under affine transformations ofthe data (that is, under operations relatingto changes in location, scale, and rotationof the data).Thus the clustering process does notdepend on irrelevant factors such as theunits of measurement or the orientation ofthe clusters in space.
  46. 46.  Height  H + W    x =  Weight   H-W   BP   BP     
  47. 47. MIXTURE OF g NORMAL COMPONENTS f ( x ) = π 1φ ( x; μ1 , Σ1 ) +  + π gφ ( x; μg , Σ g )where − 2 log φ ( x; μ, Σ ) = ( x − μ) Σ ( x − μ) + constant T T −1 −1     MAHALANOBIS DISTANCE ( x − μ )T ( x − μ ) EUCLIDEAN DISTANCE
  48. 48. MIXTURE OF g NORMAL COMPONENTS f ( x ) = π 1φ ( x; μ1 , Σ1 ) +  + π gφ ( x; μg , Σ g )k-means Σ1 =  = Σg =  I g σ I 2 2 SPHERICAL CLUSTERS
  49. 49. Equal spherical covariance matrices
  50. 50. Crab DataFigure 6: Plot of Crab Data
  51. 51. Figure 7: Contours of the fitted componentdensities on the 2nd & 3rd variates for the blue crab data set.
  52. 52. With a mixture model-based approach toclustering, an observation is assignedoutright to the ith cluster if its density inthe ith component of the mixturedistribution (weighted by the priorprobability of that component) is greaterthan in the other (g-1) components.f ( x ) = π 1φ ( x; μ1 , Σ1 ) +  + π iφ ( x; μi , Σ i ) +  + π gφ ( x; μg , Σ g )
  53. 53. http://www.maths.uq.edu.au/~gjm McLachlan and Peel (2000), Finite Mixture Models. Wiley.
  54. 54. Estimation of Mixture Distributions It was the publication of the seminal paper of Dempster, Laird, and Rubin (1977) on the EM algorithm that greatly stimulated interest in the use of finite mixture distributions to model heterogeneous data. McLachlan and Krishnan (1997, Wiley)
  55. 55. • If need be, the normal mixture model canbe made less sensitive to outlyingobservations by using t component densities.• With this t mixture model-based approach,the normal distribution for each componentin the mixture is embedded in a wider classof elliptically symmetric distributions with anadditional parameter called the degrees offreedom.
  56. 56. The advantage of the t mixture model is that,although the number of outliers needed forbreakdown is almost the same as with thenormal mixture model, the outliers have tobe much larger.
  57. 57. Two Clustering Problems:• Clustering of genes on basis of tissues – genes not independent• Clustering of tissues on basis of genes - latter is a nonstandard problem in cluster analysis (n << p)
  58. 58. Mixture SoftwareMcLachlan, Peel, Adams, and Basford (1999) http://www.maths.uq.edu.au/~gjm/emmix/emmix.html
  59. 59. EMMIX for Windowshttp://www.maths.uq.edu.au/~gjm/EMMIX_Demo/emmix.html
  60. 60. PROVIDES A MODEL-BASED APPROACH TO CLUSTERING McLachlan, Bean, and Peel, 2002, A Mixture Model- Based Approach to the Clustering of Microarray Expression Data, Bioinformatics 18, 413-422http://www.bioinformatics.oupjournals.org/cgi/screenpdf/18/3/413.pdf
  61. 61. Example: Microarray DataColon Data of Alon et al. (1999) n=62 (40 tumours; 22 normals) tissue samples of p=2,000 genes in a 2,000 × 62 matrix.
  62. 62. Mixture of 2 normal components
  63. 63. Mixture of 2 t components
  64. 64. Mixture of 2 t components
  65. 65. Mixture of 3 t components
  66. 66. In this process, the genes are being treatedanonymously.May wish to incorporate existing biologicalinformation on the function of genes intothe selection procedure.Lottaz and Spang (2003, Proceedings of 54th Meeting of the ISI)They structure the feature space by using a functional gridprovided by the Gene Ontology annotations.
  67. 67. Clustering of COLON DataGenes using EMMIX-GENE
  68. 68. Grouping for Colon Data 1 2 3 4 56 7 8 9 10 11 12 13 14 1516 17 18 19 20
  69. 69. Clustering of COLON DataTissues using EMMIX-GENE
  70. 70. Grouping for Colon Data 1 2 3 4 56 7 8 9 10 11 12 13 14 1516 17 18 19 20
  71. 71. Mixtures of Factor AnalyzersA normal mixture model without restrictionson the component-covariance matrices maybe viewed as too general for many situationsin practice, in particular, with highdimensional data.One approach for reducing the number ofparameters is to work in a lowerdimensionalspace by adopting mixtures of factoranalyzers (Ghahramani & Hinton, 1997).
  72. 72. g f ( x j ) = ∑ π iφ ( x j ; µi , Σ i ), i =1where Σ i = Bi B + Di T i (i = 1,..., g ),Bi is a p x q matrix and Di is adiagonal matrix.
  73. 73. Number of Components in a Mixture ModelTesting for the number of components,g, in a mixture is an important but verydifficult problem which has not beencompletely resolved.
  74. 74. Order of a Mixture ModelA mixture density with g components mightbe empirically indistinguishable from onewith either fewer than g components ormore than g components. It is thereforesensible in practice to approach the questionof the number of components in a mixturemodel in terms of an assessment of thesmallest number of components in themixture compatible with the data.
  75. 75. Likelihood Ratio Test StatisticAn obvious way of approaching theproblem of testing for the smallest value ofthe number of components in a mixturemodel is to use the LRTS, -2logλ.Suppose we wish to test the nullhypothesis, H 0 : g = g 0 versus H1 : g = g1for some g1>g0.
  76. 76. ˆWe let Ψ i denote the MLE of Ψ calculatedunder Hi , (i=0,1). Then the evidence againstH0 will be strong if λ is sufficiently small, orequivalently, if -2logλ is sufficiently large,where ˆ ˆ − 2 log λ = 2{log L(Ψ 1 ) − log L(Ψ 0 )}
  77. 77. Bootstrapping the LRTSMcLachlan (1987) proposed aresampling approach to the assessment ofthe P-value of the LRTS in testing H 0 : g = g0 v H1 : g = g1for a specified value of g0.
  78. 78. Bayesian Information CriterionThe Bayesian information criterion (BIC)of Schwarz (1978) is given by ˆ − 2 log L(Ψ ) + d log nas the penalized log likelihood to bemaximized in model selection, includingthe present situation for the number ofcomponents g in a mixture model.
  79. 79. Gap statistic (Tibshirani et al., 2001)Clest (Dudoit and Fridlyand, 2002)
  80. 80. Analysis of LEUKAEMIA Data using EMMIX-GENE
  81. 81. Grouping for Leukemia Data 1 2 3 4 56 7 8 9 10 11 12 13 14 1516 17 18 19 20
  82. 82. 21 22 23 24 2526 27 28 29 30 31 32 33 34 3536 37 38 39 40
  83. 83. Breast cancer data set in van’t Veer et al.(van’t Veer et al., 2002, Gene Expression Profiling PredictsClinical Outcome Of Breast Cancer, Nature 415) These data were the result of microarray experiments on three patient groups with different classes of breast cancer tumours. The overall goal was to identify a set of genes that could distinguish between the different tumour groups based upon the gene expression information for these groups.
  84. 84. The Economist (US), February 2, 2002 The chips are down; Diagnosing breast cancer (Gene chips have shown that there are two sorts of breast cancer)
  85. 85. Nature (2002, 4 July Issue, 418)News feature (Ball)Data visualiztion: Picture this
  86. 86. Colour-coded: this plot of gene-expression data shows breast tumours falling into two groups
  87. 87. Microarray data from 98 patients withprimary breast cancers with p = 24,881genes• 44 from good prognosis group (remained metastasis free after a period of more than 5 years)• 34 from poor prognosis group (developed distant metastases within 5 years)• 20 with hereditary form of cancer (18 with BRAC1; 2 with BRAC2)
  88. 88. Pre-processing filter of van’t Veer et al.only genes with both: • P-value less than 0.01; and • at least a two-fold difference in more than 5 out of the 98 tissues for the geneswere retained. This reduces the data set to 4869 genes.
  89. 89. Heat Map Displaying the Reduced Set of 4,869 Genes on the 98 Breast Cancer Tumours
  90. 90. Unsupervised Classification Analysis Using EMMIX-GENESteps used in the application of EMMIX-GENE: 1. Select the most relevant genes from this filtered set of 4,869 genes. The set of retained genes is thus reduced to 1,867. 2. Cluster these 1,867 genes into forty groups. The majority of gene groups produced were reasonably cohesive and distinct. 3. Using these forty group means, cluster the tissue samples into two and three components using a mixture of factor analyzers model with q = 4 factors.
  91. 91. Insert heat map of 1867 genesHeat Map of Top 1867 Genes
  92. 92. 1 2 3 4 56 7 8 9 1011 12 13 14 1516 17 18 19 20
  93. 93. 21 22 23 24 2526 27 28 29 3031 32 33 34 3536 37 38 39 40
  94. 94. i mi Ui i mi Ui i mi Ui i mi Ui1 146 112.98 11 66 25.72 21 44 13.77 31 53 9.842 93 74.95 12 38 25.45 22 30 13.28 32 36 8.953 61 46.08 13 28 25.00 23 25 13.10 33 36 8.894 55 35.20 14 53 21.33 24 67 13.01 34 38 8.865 43 30.40 15 47 18.14 25 12 12.04 35 44 8.026 92 29.29 16 23 18.00 26 58 12.03 36 56 7.437 71 28.77 17 27 17.62 27 27 11.74 37 46 7.218 20 28.76 18 45 17.51 28 64 11.61 38 19 6.149 23 28.44 19 80 17.28 29 38 11.38 39 29 4.6410 23 27.73 20 55 13.79 30 21 10.72 40 35 2.44 where i = group number mi = number in group i Ui = -2 log λi
  95. 95. Heat Map of Genes in Group G1
  96. 96. Heat Map of Genes in Group G2
  97. 97. Heat Map of Genes in Group G3
  98. 98. 1. A change in gene expression is apparent between the sporadic (first 78 tissue samples) and hereditary (last 20 tissue samples) tumours.2. The final two tissue samples (the two BRCA2 tumours) show consistent patterns of expression. This expression is different from that exhibited by the set of BRCA1 tumours.3. The problem of trying to distinguish between the two classes, patients who were disease-free after 5 years Π1 and those with metastases within 5 years Π2, is not straightforward on the basis of the gene expressions.
  99. 99. Selection of Relevant GenesWe compared the genes selected by EMMIX-GENE with those genes retained in the originalstudy by van’t Veer et al. (2002).van’t Veer et al. used an agglomerativehierarchical algorithm to organise the genes intodominant genes groups. Two of these groups werehighlighted in their paper, with their genescorresponding to biologically significant features.
  100. 100. Number of matches Number Identification of van’t Veer et al. with genes retained of genes by select-gene containing genes co-regulated with theCluster A 40 24 ER-a gene (ESR1) containing “co-regulated genes that are the molecular reflection of extensiveCluster B 40 23 lymphocytic infiltrate, and comprise a set of genes expressed in T and B cells” We can see that of the 80 genes identified by van’t Veer et al., only 47 are retained by the select-genes step of the EMMIX-GENE algorithm.
  101. 101. Comparing Clusters from Hierarchical Algorithm with those from EMMIX-GENE Algorithm Cluster Index Number Percentage Matched (EMMIX- of Genes Matched (%) GENE) 2 21 87.5 Cluster A 3 2 8.33 14 1 4.17 17 18 78.3 Cluster 19 1 4.35 B 21 4 17.4Subsets of these 47 genes appeared inside several of the40 groups produced by the cluster-genes step ofEMMIX-GENE.
  102. 102. Genes Retained by EMMIX-GENE Appearing in Cluster A (vertical blue lines indicate the three groups of tumours)
  103. 103. Genes Rejected by EMMIX-GENE Appearing in Cluster A
  104. 104. Genes Retained by EMMIX-GENE Appearing in Cluster B
  105. 105. Genes Rejected by EMMIX-GENE Appearing in Cluster B
  106. 106. Assessing the Number of Tissue GroupsTo assess the number of components g to be used inthe normal mixture the likelihood ratio statistic λwas adopted, and the resampling approach used toassess the P-value.By proceeding sequentially, testing the nullhypothesis H0: g = g0 versus the alternativehypothesis H1: g = g0 + 1, starting with g0 = 1 andcontinuing until a non-significant result wasobtained it was concluded that g = 3 componentswere adequate for this data set.
  107. 107. Clustering Tissue Samples on the Basis of Gene Groups using EMMIX-GENE Tissue samples can be subdivided into two groups corresponding to 78 sporadic tumours and 20 hereditary tumours. When the two cluster assignment of EMMIX-GENE is compared to this genuine grouping, only 1 of the 20 hereditary tumour patients is misallocated, although 37 of the sporadic tumour patients are incorrectly assigned to the hereditary tumour cluster.
  108. 108. Using a mixture of factor analyzers model with q = 8factors, we would misallocate: 7 out of the 44 members of Π1; 24 out of the 34 members of Π2; and 1 of the 18 BRCA1 samples.The misallocation rate of 24/34 for the second class,Π2, is not surprising given both the gene expressionsas summarized in the groups of genes and that we areclassifying the tissues in an unsupervised mannerwithout using the knowledge of their trueclassification.
  109. 109. Supervised ClassificationWhen knowledge of the groups’ true classification isused (van’t Veer et al.), the reported error rate wasapproximately 50% for members of Π2 whenallowance was made for the selection bias in forming aclassifier on the basis of an optimal subset of thegenes.Further analysis of this data set in a supervised contextconfirms the difficulty in trying to discriminatebetween the disease-free class Π1 and the metastasesclass Π2. (Tibshirani and Efron, 2002, “Pre-Validation and Inference inMicroarrays”, Statistical Applications In Genetics And Molecular Biology 1)
  110. 110. Investigating Underlying Signatures With Other Clinical IndicatorsThe three clusters constructed by EMMIX-GENE were investigated in order to determinewhether they followed a pattern contingent uponthe clinical predictors of histological grade,angioinvasion, oestrogen receptor, lymphocyticinfiltrate.
  111. 111. Microarrays have become promisingdiagnostic tools for clinical applications.However, large-scale screeningapproaches in general and microarraytechnology in particular, inescapablylead to the challenging problem oflearning from high-dimensional data.
  112. 112. Hope to see you in Cairns in 2004!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×