Introduction to Bioinformatics:
      Mining Your Data

          Gerry Lushington
         Lushington in Silico

  modeling / informatics consultant
What is Data Mining?
 Use of computational methods to perceive trends in data that
 can be used to explain or predict important outcomes or
 properties

Applicable across many disciplines:
Molecular bioinformatics

Medical Informatics

Health Informatics

Biodiversity informatics
Example Applications:
     Find relationships between:

Convenient Observables                vs.    Important Outcomes
a)    Relative gene expression data          1.   Disease susceptibility
b)    Relative protein abundance data        2.   Drug efficacy
c)    Relative lipid & metabolite profiles   3.   Toxin susceptibility
d)    Glycosylation variants                 4.   Immunity
e)    SNPs, alleles                          5.   Genetic disorders
f)    Cellular traits                        6.   Microbial virulence
g)    Organism traits                        7.   Species adaptive success
h)    Behavioral traits                      8.   Species complementarity
i)    Case history
Goals for this lecture:
Focus on Data Mining: how to approach your data and use it to
understand biology

Overview of available techniques

Understanding model validation

Try to think about data you’ve seen: what techniques might be
useful?




        Don’t worry about grasping everything:
     K-INBRE Bioinformatics Core is here to help!!
Basic Data Mining:
Find relationships between:
a) Easy to measure properties   vs.
b) Important (but harder to measure) outcomes or attributes

Use relationships to understand the conceptual basis for
outcomes in b)

Use relationships to predict outcomes in new cases where
outcome has not yet been measured
Basic Data Mining: simple measureables
Basic Data Mining: general observation




          Unhappy         Happy
Basic Data Mining: relationship (#1)




              Unhappy                Happy


    Blue = happy; Red = unhappy   accuracy = 12/20 = 60%
Basic Data Mining: relationship (#2)




               Unhappy                     Happy


  Blue + BIG Red = happy; little red = unhappy     accuracy = 17/20 = 85%
Data Mining: procedure

1.   Data Acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing       Peak heights?
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                   Peak positions?
Key issues include:
a) format conversion from instrument
b) any necessary mathematical manipulations (e.g., Density = M/V)
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


Key issues include:
a) Normalization to account for experimental bias
b) Statistical detection of flagrant outliers
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection                                   Use controls to
4.   Classification                                        scale data
5.   Validation
6.   Prediction & Iteration


Key issues include:
a) Normalization to account for experimental bias
b) Statistical detection of flagrant outliers




C    C 1 2 3      C 1 2 3                C 1 2 3    C 1 2 3
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection                                   Subjective
4.   Classification                                 (requires experience
5.   Validation                                        and/or domain
6.   Prediction & Iteration                             knowledge)


Key issues include:
a) Normalization to account for experimental bias
b) Statistical detection of flagrant outliers
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


Which out of many measurable properties relate to outcome of interest?
a)   Intrinsic information content
b)   Redundancy relative to other properties
c)   Correlation with target attribute
d)   Iterative model training
Data Mining: procedure

1.   Data acquisition
2.
3.
     Data Preprocessing
     Feature Selection
                                  x              x
4.   Classification
5.   Validation
6.   Prediction & Iteration     1 2 3 4        1 2 3 4   1 2 3 4    1 2 3 4


Which out of many measurable properties relate to outcome of interest?
a)   Intrinsic information content
b)   Redundancy relative to other properties
c)   Correlation with target attribute
d)   Iterative model training
Data Mining: procedure

1.   Data acquisition
2.
3.
     Data Preprocessing
     Feature Selection
                                                 x
4.   Classification
5.   Validation
6.   Prediction & Iteration     1 2 3 4        1 2 3 4   1 2 3 4    1 2 3 4


Which out of many measurable properties relate to outcome of interest?
a)   Intrinsic information content
b)   Redundancy relative to other properties
c)   Correlation with target attribute
d)   Iterative model training
Data Mining: procedure

1.   Data acquisition
2.
3.
     Data Preprocessing
     Feature Selection
                                                          x
4.   Classification
5.   Validation
6.   Prediction & Iteration     1 2 3 4        1 2 3 4   1 2 3 4    1 2 3 4


Which out of many measurable properties relate to outcome of interest?
a)   Intrinsic information content
b)   Redundancy relative to other properties
c)   Correlation with target attribute
d)   Iterative model training

                                                               1 2 3 4
Data Mining: procedure

1.   Data acquisition                 • Train preliminary models based on
                                        random sets of properties
2.   Data Preprocessing
                                      • Evaluate models according to
3.   Feature Selection                  correlative or predictive performance
4.   Classification                   • Experiment with promising sets adding
5.   Validation                         or deleting descriptors to gauge impact
6.   Prediction & Iteration             on performance


Which out of many measurable properties relate to outcome of interest?
a)   Intrinsic information content
b)   Redundancy relative to other properties
c)   Correlation with target attribute
d)   Iterative model training
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


 Predict which sample will have which outcome?
a)   Correlative methods
b)   Distance-based clustering
c)   Boundary detection
d)   Rule learning
e)   Weighted probability
Data Mining: procedure                          y



1.   Data acquisition
2.   Data Preprocessing                             x
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


Predict which sample will have which outcome?
a)   Correlative methods
b)   Distance-based clustering
c)   Boundary detection
d)   Rule learning
e)   Weighted probability
Data Mining: procedure                          y



1.   Data acquisition
2.   Data Preprocessing                                          x
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                     -n y +n
Predict which sample will have which outcome?   NO               YES

a)   Correlative methods
b)   Distance-based clustering
c)   Boundary detection
d)   Rule learning
e)   Weighted probability
Data Mining: procedure
                                                x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                     x1

Predict which sample will have which outcome?
a)   Correlative methods
b)   Distance-based clustering
c)   Boundary detection
d)   Rule learning
e)   Weighted probability
Data Mining: procedure                                  y1            y2

                                                 x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification                                                             y3
5.   Validation                                 y4
6.   Prediction & Iteration
                                                                           x1

Predict which sample will have which outcome?
a)   Correlative methods                y1 = resistant to types I & II diabetes
b)   Distance-based clustering          y2 = susceptible only to type II
c)   Boundary detection
d)   Rule learning                      y3 = susceptible only to type I
e)   Weighted probability               y4 = susceptible to types I & II
Data Mining: procedure                               Resistant to type I

                                                x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                                        x1
                                                     Susceptible to type I
Predict which sample will have which outcome?
a)   Correlative methods
b)   Distance-based clustering
c)   Boundary detection
d)   Rule learning
e)   Weighted probability
Data Mining: procedure                               Resistant to type I

                                                x2
1.   Data acquisition
2.   Data Preprocessing                                                         b
3.   Feature Selection
4.   Classification                              a
5.   Validation
6.   Prediction & Iteration
                                                                   c       x1
                                                     Susceptible to type I
Predict which sample will have which outcome?
a)   Correlative methods
b)   Distance-based clustering              If x1 < c and x2 > a then resistant
c)   Boundary detection                     Else if x1 > c and x2 > b then resistant
d)   Rule learning                          Else susceptible
e)   Weighted probability

                                                                           E=9
Data Mining: procedure
                                             Resistant                      Susc.
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection                                           a                 x1
4.   Classification
5.   Validation                                   Susc.                    Resistant
6.   Prediction & Iteration

                                                                b                  x2
Predict which sample will have which outcome?
a)   Correlative methods                     Resistant                      Susc.
b)   Distance-based clustering
c)   Boundary detection
                                                                    c      Fx1 -
d)   Rule learning
                                                                           Gx2
e)   Weighted probability                If Fx1 - Gx2 < c then resistant
                                         Else susceptible
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


Define criteria and tests to prove model validity
a)   Accuracy
b)   Sensitivity vs. Specificity
c)   Receiver Operating Characteristic (ROC) plot
d)   Cross-validation
Data Mining: procedure
                                                    x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection          Resistant (Neg.)
4.   Classification
5.   Validation                                                                    Susc.
6.   Prediction & Iteration
                                                                                x1 (Pos.)

Define criteria and tests to prove model validity
a)   Accuracy                                            Accuracy =      (TP + TN)
b)   Sensitivity vs. Specificity                                      TP + TN + FP + FN
c)   Receiver Operating Characteristic (ROC) plot
d)   Cross-validation                                             = 142 / 154
Data Mining: procedure
                                                       x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection             Resistant (Neg.)
4.   Classification
5.   Validation                                                            Susc.
6.   Prediction & Iteration
                                                                        x1 (Pos.)

Define criteria and tests to prove model validity
a)   Accuracy                                   Sensitivity =   TP      = 67 / 72
b)   Sensitivity vs. Specificity
c)   Receiver Operating Characteristic (ROC) plot             TP + FN
d)   Cross-validation
                                                      FPR =     FP      = 6 / 81
                                                              TN + FP
                 Note: Specificity = 1 - FPR
Data Mining: procedure
                                                         x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection             Resistant (Neg.)
4.   Classification                               less
5.   Validation                        Varying                             Susc.
6.   Prediction & Iteration            model
                                                     more               x1 (Pos.)
                                     stringency

Define criteria and tests to prove model validity
a)   Accuracy                                   Sensitivity =   TP      = 69 / 72
b)   Sensitivity vs. Specificity
c)   Receiver Operating Characteristic (ROC) plot             TP + FN
d)   Cross-validation
                                                      FPR =     FP      = 19 / 81
                                                              TN + FP
                 Note: Specificity = 1 - FPR
Data Mining: procedure
                                               Sens
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                      FPR

Define criteria and tests to prove model validity
a)   Accuracy
b)   Sensitivity vs. Specificity
c)   Receiver Operating Characteristic (ROC) plot
d)   Cross-validation
Data Mining: procedure
                                               Sens
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                                  FPR

Define criteria and tests to prove model validity
                                                      Area under curve is
a)   Accuracy
                                                      excellent measure of
b)   Sensitivity vs. Specificity                      model performance
c)   Receiver Operating Characteristic (ROC) plot
d)   Cross-validation                                 1.0: perfect model
                                                      0.5: random
Data Mining: procedure

1.   Data acquisition                     Predictions are imperfect due to:
2.   Data Preprocessing                   • Imperfect Algorithms
3.   Feature Selection                    • Imperfect Data
4.   Classification
5.   Validation
6.   Prediction & Iteration


Define criteria and tests to prove model validity
a)   Accuracy
b)   Sensitivity vs. Specificity
c)   Receiver Operating Characteristic (ROC) plot
d)   Cross-validation
Cross-Validation:

• Carefully monitor features that are useful across different
  independent data subsets
• This can be accomplished with N-fold cross-validation:

     Trial 1        Trial 2      Trial 3      Trial 4      Trial 5
            Test




    Train          Model performance = mean predictive performance over 5 trials


• Best feature selection and classification algorithms will yield
  best consistent performance across independent trials
• Best features will be consistently important across trials
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


Analysis is only useful if it is used; only improves if it is tested
a) Good validation requires successful new predictions
b) Imperfect predictions can lead to method refinement and
   greater understanding
Questions?


      Lushington in Silico
Geraldlushington3117 at aol.com
     Geraldlushington.org

Introduction to Data Mining / Bioinformatics

  • 1.
    Introduction to Bioinformatics: Mining Your Data Gerry Lushington Lushington in Silico modeling / informatics consultant
  • 2.
    What is DataMining? Use of computational methods to perceive trends in data that can be used to explain or predict important outcomes or properties Applicable across many disciplines: Molecular bioinformatics Medical Informatics Health Informatics Biodiversity informatics
  • 3.
    Example Applications: Find relationships between: Convenient Observables vs. Important Outcomes a) Relative gene expression data 1. Disease susceptibility b) Relative protein abundance data 2. Drug efficacy c) Relative lipid & metabolite profiles 3. Toxin susceptibility d) Glycosylation variants 4. Immunity e) SNPs, alleles 5. Genetic disorders f) Cellular traits 6. Microbial virulence g) Organism traits 7. Species adaptive success h) Behavioral traits 8. Species complementarity i) Case history
  • 4.
    Goals for thislecture: Focus on Data Mining: how to approach your data and use it to understand biology Overview of available techniques Understanding model validation Try to think about data you’ve seen: what techniques might be useful? Don’t worry about grasping everything: K-INBRE Bioinformatics Core is here to help!!
  • 5.
    Basic Data Mining: Findrelationships between: a) Easy to measure properties vs. b) Important (but harder to measure) outcomes or attributes Use relationships to understand the conceptual basis for outcomes in b) Use relationships to predict outcomes in new cases where outcome has not yet been measured
  • 6.
    Basic Data Mining:simple measureables
  • 7.
    Basic Data Mining:general observation Unhappy Happy
  • 8.
    Basic Data Mining:relationship (#1) Unhappy Happy Blue = happy; Red = unhappy accuracy = 12/20 = 60%
  • 9.
    Basic Data Mining:relationship (#2) Unhappy Happy Blue + BIG Red = happy; little red = unhappy accuracy = 17/20 = 85%
  • 10.
    Data Mining: procedure 1. Data Acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration
  • 11.
    Data Mining: procedure 1. Data acquisition 2. Data Preprocessing Peak heights? 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Peak positions? Key issues include: a) format conversion from instrument b) any necessary mathematical manipulations (e.g., Density = M/V)
  • 12.
    Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Key issues include: a) Normalization to account for experimental bias b) Statistical detection of flagrant outliers
  • 13.
    Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection Use controls to 4. Classification scale data 5. Validation 6. Prediction & Iteration Key issues include: a) Normalization to account for experimental bias b) Statistical detection of flagrant outliers C C 1 2 3 C 1 2 3 C 1 2 3 C 1 2 3
  • 14.
    Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection Subjective 4. Classification (requires experience 5. Validation and/or domain 6. Prediction & Iteration knowledge) Key issues include: a) Normalization to account for experimental bias b) Statistical detection of flagrant outliers
  • 15.
    Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Which out of many measurable properties relate to outcome of interest? a) Intrinsic information content b) Redundancy relative to other properties c) Correlation with target attribute d) Iterative model training
  • 16.
    Data Mining: procedure 1. Data acquisition 2. 3. Data Preprocessing Feature Selection x x 4. Classification 5. Validation 6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Which out of many measurable properties relate to outcome of interest? a) Intrinsic information content b) Redundancy relative to other properties c) Correlation with target attribute d) Iterative model training
  • 17.
    Data Mining: procedure 1. Data acquisition 2. 3. Data Preprocessing Feature Selection x 4. Classification 5. Validation 6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Which out of many measurable properties relate to outcome of interest? a) Intrinsic information content b) Redundancy relative to other properties c) Correlation with target attribute d) Iterative model training
  • 18.
    Data Mining: procedure 1. Data acquisition 2. 3. Data Preprocessing Feature Selection x 4. Classification 5. Validation 6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Which out of many measurable properties relate to outcome of interest? a) Intrinsic information content b) Redundancy relative to other properties c) Correlation with target attribute d) Iterative model training 1 2 3 4
  • 19.
    Data Mining: procedure 1. Data acquisition • Train preliminary models based on random sets of properties 2. Data Preprocessing • Evaluate models according to 3. Feature Selection correlative or predictive performance 4. Classification • Experiment with promising sets adding 5. Validation or deleting descriptors to gauge impact 6. Prediction & Iteration on performance Which out of many measurable properties relate to outcome of interest? a) Intrinsic information content b) Redundancy relative to other properties c) Correlation with target attribute d) Iterative model training
  • 20.
    Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Predict which sample will have which outcome? a) Correlative methods b) Distance-based clustering c) Boundary detection d) Rule learning e) Weighted probability
  • 21.
    Data Mining: procedure y 1. Data acquisition 2. Data Preprocessing x 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Predict which sample will have which outcome? a) Correlative methods b) Distance-based clustering c) Boundary detection d) Rule learning e) Weighted probability
  • 22.
    Data Mining: procedure y 1. Data acquisition 2. Data Preprocessing x 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration -n y +n Predict which sample will have which outcome? NO YES a) Correlative methods b) Distance-based clustering c) Boundary detection d) Rule learning e) Weighted probability
  • 23.
    Data Mining: procedure x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration x1 Predict which sample will have which outcome? a) Correlative methods b) Distance-based clustering c) Boundary detection d) Rule learning e) Weighted probability
  • 24.
    Data Mining: procedure y1 y2 x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification y3 5. Validation y4 6. Prediction & Iteration x1 Predict which sample will have which outcome? a) Correlative methods y1 = resistant to types I & II diabetes b) Distance-based clustering y2 = susceptible only to type II c) Boundary detection d) Rule learning y3 = susceptible only to type I e) Weighted probability y4 = susceptible to types I & II
  • 25.
    Data Mining: procedure Resistant to type I x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration x1 Susceptible to type I Predict which sample will have which outcome? a) Correlative methods b) Distance-based clustering c) Boundary detection d) Rule learning e) Weighted probability
  • 26.
    Data Mining: procedure Resistant to type I x2 1. Data acquisition 2. Data Preprocessing b 3. Feature Selection 4. Classification a 5. Validation 6. Prediction & Iteration c x1 Susceptible to type I Predict which sample will have which outcome? a) Correlative methods b) Distance-based clustering If x1 < c and x2 > a then resistant c) Boundary detection Else if x1 > c and x2 > b then resistant d) Rule learning Else susceptible e) Weighted probability E=9
  • 27.
    Data Mining: procedure Resistant Susc. 1. Data acquisition 2. Data Preprocessing 3. Feature Selection a x1 4. Classification 5. Validation Susc. Resistant 6. Prediction & Iteration b x2 Predict which sample will have which outcome? a) Correlative methods Resistant Susc. b) Distance-based clustering c) Boundary detection c Fx1 - d) Rule learning Gx2 e) Weighted probability If Fx1 - Gx2 < c then resistant Else susceptible
  • 28.
    Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Define criteria and tests to prove model validity a) Accuracy b) Sensitivity vs. Specificity c) Receiver Operating Characteristic (ROC) plot d) Cross-validation
  • 29.
    Data Mining: procedure x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection Resistant (Neg.) 4. Classification 5. Validation Susc. 6. Prediction & Iteration x1 (Pos.) Define criteria and tests to prove model validity a) Accuracy Accuracy = (TP + TN) b) Sensitivity vs. Specificity TP + TN + FP + FN c) Receiver Operating Characteristic (ROC) plot d) Cross-validation = 142 / 154
  • 30.
    Data Mining: procedure x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection Resistant (Neg.) 4. Classification 5. Validation Susc. 6. Prediction & Iteration x1 (Pos.) Define criteria and tests to prove model validity a) Accuracy Sensitivity = TP = 67 / 72 b) Sensitivity vs. Specificity c) Receiver Operating Characteristic (ROC) plot TP + FN d) Cross-validation FPR = FP = 6 / 81 TN + FP Note: Specificity = 1 - FPR
  • 31.
    Data Mining: procedure x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection Resistant (Neg.) 4. Classification less 5. Validation Varying Susc. 6. Prediction & Iteration model more x1 (Pos.) stringency Define criteria and tests to prove model validity a) Accuracy Sensitivity = TP = 69 / 72 b) Sensitivity vs. Specificity c) Receiver Operating Characteristic (ROC) plot TP + FN d) Cross-validation FPR = FP = 19 / 81 TN + FP Note: Specificity = 1 - FPR
  • 32.
    Data Mining: procedure Sens 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration FPR Define criteria and tests to prove model validity a) Accuracy b) Sensitivity vs. Specificity c) Receiver Operating Characteristic (ROC) plot d) Cross-validation
  • 33.
    Data Mining: procedure Sens 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration FPR Define criteria and tests to prove model validity Area under curve is a) Accuracy excellent measure of b) Sensitivity vs. Specificity model performance c) Receiver Operating Characteristic (ROC) plot d) Cross-validation 1.0: perfect model 0.5: random
  • 34.
    Data Mining: procedure 1. Data acquisition Predictions are imperfect due to: 2. Data Preprocessing • Imperfect Algorithms 3. Feature Selection • Imperfect Data 4. Classification 5. Validation 6. Prediction & Iteration Define criteria and tests to prove model validity a) Accuracy b) Sensitivity vs. Specificity c) Receiver Operating Characteristic (ROC) plot d) Cross-validation
  • 35.
    Cross-Validation: • Carefully monitorfeatures that are useful across different independent data subsets • This can be accomplished with N-fold cross-validation: Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Test Train Model performance = mean predictive performance over 5 trials • Best feature selection and classification algorithms will yield best consistent performance across independent trials • Best features will be consistently important across trials
  • 36.
    Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Analysis is only useful if it is used; only improves if it is tested a) Good validation requires successful new predictions b) Imperfect predictions can lead to method refinement and greater understanding
  • 37.
    Questions? Lushington in Silico Geraldlushington3117 at aol.com Geraldlushington.org