PowerPoint icon Data Mining


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Little control of the collection leads to challenges in the data. ML algorithms handle some of these issues. Need HPC handle computational requirements raised by ML algs and big data.
  • Practical definition. Workshop definition.
  • KDD vs. data mining vs. machine learning. Role of processing and management. Role of computational infrastructure. Keep human processing to a minimum. Optimize integration.
  • Little control of the collection leads to challenges in the data. ML algorithms handle some of these issues. Need HPC handle computational requirements raised by ML algs and big data.
  • PowerPoint icon Data Mining

    1. 1. SDSC Summer Institute 2005 TUTORIAL Data Mining for Scientific Applications Peter Shin Hector Jasso San Diego Supercomputer Center UCSD
    2. 2. Overview <ul><li>Introduction to data mining </li></ul><ul><ul><li>Definitions, concepts, applications </li></ul></ul><ul><ul><li>Machine learning methods for KDD </li></ul></ul><ul><ul><ul><li>Supervised learning – classification </li></ul></ul></ul><ul><ul><ul><li>Unsupervised learning – clustering </li></ul></ul></ul><ul><li>Cyberinfrastructure for data mining </li></ul><ul><ul><li>SDSC resources – hardware and software </li></ul></ul><ul><li>Survey of Applications at SKIDL </li></ul><ul><li>Break </li></ul><ul><li>Hands on tutorial with IBM Intelligent Miner and SKIDLkit </li></ul><ul><ul><li>Targeted Marketing </li></ul></ul><ul><ul><li>Microarray analysis (leukemia dataset) </li></ul></ul>
    3. 3. Data Mining Definition <ul><li>The search for interesting patterns and models, </li></ul><ul><li>in large data collections, </li></ul><ul><li>using statistical and machine learning methods, </li></ul><ul><li>and high-performance computational infrastructure. </li></ul>Key point: applications are data-driven and compute-intensive
    4. 4. Analysis Levels and Infrastructure <ul><li>Informal methods – graphs, plots, visualizations, exploratory data analysis (yes – Excel is a data mining tool) </li></ul><ul><li>Advanced query processing and OLAP – e.g., National Virtual Observatory (NVO) </li></ul><ul><li>Machine learning (compute-intensive statistical methods) </li></ul><ul><ul><li>Supervised – classification, prediction </li></ul></ul><ul><ul><li>Unsupervised – clustering </li></ul></ul><ul><li>Computational infrastructure needed at all levels – collections management, information integration, high-performance database systems, web services, grid services, scientific workflows, the global IT grid, observing systems </li></ul>
    5. 5. The Case for Data Mining: Data Reality <ul><li>Deluge from new sources </li></ul><ul><ul><li>Remote sensing </li></ul></ul><ul><ul><li>Microarray processing </li></ul></ul><ul><ul><li>Wireless communication </li></ul></ul><ul><ul><li>Simulation models </li></ul></ul><ul><ul><li>Instrumentation – microscopes, telescopes </li></ul></ul><ul><ul><li>Digital publishing </li></ul></ul><ul><ul><li>Federation of collections </li></ul></ul><ul><li>“ 5 exabytes (5 million terabytes) of new information was created in 2002” (source: UC Berkeley researchers Peter Lyman and Hal Varian) </li></ul><ul><li>This is the result of a recent paradigm shift: from hypothesis-driven data collection to data mining </li></ul><ul><li>Data destination: Legacy archives and independent collection activities </li></ul>
    6. 6. Knowledge Discovery Process Collection Processing/Cleansing/Corrections Analysis/Modeling Presentation/Visualization Application/Decision Support Management/Federation/Warehousing Data Knowledge “ Data is not information; information is not knowledge; knowledge is not wisdom.” Gary Flake, Principal Scientist & Head of Yahoo! Research Labs, July 2004.
    7. 7. Characteristics of Data Mining Applications <ul><li>Data : </li></ul><ul><ul><li>Lots of data, numerous sources </li></ul></ul><ul><ul><li>Noisy – missing values, outliers, interference </li></ul></ul><ul><ul><li>Heterogeneous – mixed types, mixed media </li></ul></ul><ul><ul><li>Complex – scale, resolution, temporal, spatial dimensions </li></ul></ul><ul><li>Relatively little domain theory , few quantitative causal models </li></ul><ul><li>Lack of valid ground truth </li></ul><ul><li>Advice: don’t choose problems that have all these characteristics … </li></ul>
    8. 8. Scientific vs. Commercial Data Mining <ul><li>Goals: </li></ul><ul><ul><li>Science – Theories: Need for insight and theory-based models, interpretable model structures, generate domain rules or causal structures, support for theory development </li></ul></ul><ul><ul><li>Commercial – Profits: black boxes OK </li></ul></ul><ul><li>Types of data: </li></ul><ul><ul><li>Science – Images, sensors, simulations </li></ul></ul><ul><ul><li>Commercial - Transaction data </li></ul></ul><ul><ul><li>Both - Spatial and temporal dimensions, heterogeneous </li></ul></ul><ul><li>Trend – Common IT (information technology) tools fit both enterprises </li></ul><ul><ul><li>Database systems (Oracle, DB2, etc), integration tools (Information Integrator), web services (Blue Titan, .NET) </li></ul></ul><ul><ul><li>This is good! </li></ul></ul>
    9. 9. Introduction to Machine Learning <ul><li>Basic machine learning theory </li></ul><ul><li>Concepts and feature vectors </li></ul><ul><li>Supervised and unsupervised learning </li></ul><ul><li>Model development </li></ul><ul><ul><li>training and testing methodology, </li></ul></ul><ul><ul><li>model validation, </li></ul></ul><ul><ul><li>overfitting </li></ul></ul><ul><ul><li>confusion matrices </li></ul></ul><ul><li>Survey of algorithms </li></ul><ul><ul><li>Decision Trees classification </li></ul></ul><ul><ul><li>k-means clustering </li></ul></ul><ul><ul><li>Hierarchical clustering </li></ul></ul><ul><ul><li>Bayesian networks and probabilistic inference </li></ul></ul><ul><ul><li>Support vector machines </li></ul></ul>
    10. 10. Basic Machine Learning Theory <ul><li>Basic inductive learning hypothesis: </li></ul><ul><ul><li>Having a large number of observations, we can approximate the rule that describes how the data was generated, and thus generate a model (using some algorithm) </li></ul></ul><ul><li>No Free Lunch Theorem : </li></ul><ul><ul><li>There is no ultimate algorithm: In the absence of prior information about the problem, there are no reasons to prefer one learning algorithm over another. </li></ul></ul><ul><li>Conclusion : </li></ul><ul><ul><li>There is no problem-independent “best” learning system. Formal theory and algorithms are not enough. </li></ul></ul><ul><ul><li>Machine learning is an empirical subject. </li></ul></ul>
    11. 11. Concepts are described as feature vectors <ul><li>Example: vehicles </li></ul><ul><ul><li>Has wheels </li></ul></ul><ul><ul><li>Runs on gasoline </li></ul></ul><ul><ul><li>Carries people </li></ul></ul><ul><ul><li>Flies </li></ul></ul><ul><ul><li>Weighs less than 500 pounds </li></ul></ul><ul><li>Boolean feature vectors for vehicles </li></ul><ul><ul><li>car254 [ 1 1 1 0 0 ] </li></ul></ul><ul><ul><li>motorcyle14 [ 1 1 1 0 1 ] </li></ul></ul><ul><ul><li>airplane132 [ 1 1 1 1 0 ] </li></ul></ul>
    12. 12. <ul><li>Easy to generalize to complex data types: </li></ul><ul><ul><li>Number of wheels </li></ul></ul><ul><ul><li>Fuel type </li></ul></ul><ul><ul><li>Carrying capacity </li></ul></ul><ul><ul><li>Flies </li></ul></ul><ul><ul><li>Weight </li></ul></ul><ul><ul><li>car254 [ 4, gas, 6, 0, 2000 ] </li></ul></ul><ul><ul><li>motorcyle14 [ 2, gas, 2, 0, 400 ] </li></ul></ul><ul><ul><li>airplane132 [ 10, jetfuel, 110, 1, 35000 ] </li></ul></ul><ul><li>Most machine learning algorithms expect feature vectors, stored in text files or databases </li></ul><ul><li>Suggestions: </li></ul><ul><ul><li>Identify the target concept </li></ul></ul><ul><ul><li>Organize your data to fit feature vector representation </li></ul></ul><ul><ul><li>Design your database schemas to support generation of data in this format </li></ul></ul>
    13. 13. Supervised vs. Unsupervised Learning <ul><li>Supervised – Each feature vector belongs to a class (label). Labels are given externally, and algorithms learn to predict the label of new samples/observations. </li></ul><ul><li>Unsupervised – Finds structure in the data, by clustering similar elements together. No previous knowledge of classes needed. </li></ul>
    14. 14. Model development <ul><li>Model validation </li></ul><ul><ul><li>Hold-out validation (2/3, 1/3 splits) </li></ul></ul><ul><ul><li>Cross validation, simple and n-fold (reuse) </li></ul></ul><ul><ul><li>Bootstrap validation (sample with replacement) </li></ul></ul><ul><ul><li>Jackknife validation (leave one out) </li></ul></ul><ul><ul><li>When possible hide a subset of the data until train-test is complete. </li></ul></ul>Train Test Apply Training and testing
    15. 15. Train Test Overfitting Optimal Depth Avoid overfitting
    16. 16. Train Test Overfitting Optimal Depth Avoid overfitting
    17. 17. Confusion matrices Predicted Actual Negative Negative Positive Positive Accuracy = (124 + 84) / (124 + 15 + 8 + 84) “proportion of predictions correct” True positive rate = 84 / (8 + 84) “proportion of positive cases correctly identified” False positive rate = 15 / (124 + 15) “proportion of negative cases incorrectly class as positive” True negative rate = 124 / (124 + 15) “proportion of negative cases correctly identified” False negative rate = 8 / (8 + 84) “proportion of positive cases incorrectly class as negative” Precision = 84 / (15 + 84) “proportion of predicted positive cases that were correct” 84 8 15 124
    18. 18. Classification – Decision Tree Annual Precipitation Ecosystem 63 Prairie 116 Forest 5 Desert 104 Forest 120 Forest 2 Desert
    19. 19. Precipitation > 63? YES NO 63 Prairie 116 Forest 5 Desert 104 Forest 120 Forest 2 Desert 104 Forest 116 Forest 120 Forest 2 Desert 63 Prairie 5 Desert
    20. 20. Precipitation > 5? Precipitation > 63? YES NO NO YES 63 Prairie 116 Forest 5 Desert 104 Forest 120 Forest 2 Desert 104 Forest 116 Forest 120 Forest 63 Prairie 2 Desert 63 Prairie 5 Desert 2 Desert 5 Desert
    21. 21. If (Precip > 63 ) then “Forest” else If (Precip > 5) then “Prairie” else “Desert” Classification accuracy on training data is 100% D F P F D P Actual Learned Model Predicted Confusion matrix 63 Prairie 116 Forest 5 Desert 104 Forest 120 Forest 2 Desert 1 0 0 0 3 0 0 0 2
    22. 22. Testing Set Results IF(Precip > 63 ) then Forest Else If (Precip > 5) then Prairie Else Desert Learned Model Test Data Result: Accuracy 67% Model shows overfitting, generalizes poorly True Predicted D F P F D P Actual Predicted Confusion matrix 72 Prairie 116 Forest 4 Desert 55 Prairie 100 Forest 8 Desert Forest Forest Desert Prairie Forest Prairie 1 1 0 0 2 0 1 0 1
    23. 23. Pruning to improve generalization Pruned Decision Tree Precipitation < 60? IF(Precip < 60 ) then Desert Else, [P(Forest) = .75] & [P(Prairie) = .25] 63 Prairie 116 Forest 5 Desert 104 Forest 120 Forest 2 Desert 104 Forest 63 Prairie 116 Forest 120 Forest 5 Desert 2 Desert
    24. 24. Decision Trees Summary <ul><li>Simple to understand </li></ul><ul><li>Works with mixed data types </li></ul><ul><li>Heuristic search sensitive to local minima </li></ul><ul><li>Models non-linear functions </li></ul><ul><li>Handles classification and regression </li></ul><ul><li>Many successful applications </li></ul><ul><li>Readily available tools </li></ul>
    25. 25. Overview of Clustering <ul><li>Definition: </li></ul><ul><ul><ul><li>Clustering is the discovery of classes </li></ul></ul></ul><ul><ul><ul><li>Unlabeled examples => unsupervised learning. </li></ul></ul></ul><ul><li>Survey of Applications </li></ul><ul><ul><ul><li>Grouping of web-visit data, clustering of genes according to their expression values, grouping of customers into distinct profiles, </li></ul></ul></ul><ul><li>Survey of Methods </li></ul><ul><ul><ul><li>k-means clustering </li></ul></ul></ul><ul><ul><ul><li>Hierarchical clustering </li></ul></ul></ul><ul><ul><ul><li>Expectation Maximization (EM) algorithm </li></ul></ul></ul><ul><ul><ul><li>Gaussian mixture modeling </li></ul></ul></ul><ul><li>Cluster analysis </li></ul><ul><ul><li>Concept (class) discovery </li></ul></ul><ul><ul><li>Data compression/summarization </li></ul></ul><ul><ul><li>Bootstrapping knowledge </li></ul></ul>
    26. 26. Clustering – k-Means Precipitation Temperature 49 32 76 17 45 49 63 62 70 71 81 8
    27. 27. Clustering – k-Means
    28. 28. Clustering – k-Means
    29. 29. Clustering – k-Means
    30. 30. Clustering – k-Means
    31. 31. Clustering – k-Means
    32. 32. Clustering – k-Means
    33. 33. Clustering – k-Means Cluster Temperature Precipitation 50 – 80 50 – 80 C3 25 - 55 35 - 60 C2 0 - 25 70 - 85 C1
    34. 34. Clustering – k-Means Cluster Temperature Precipitation 50 – 80 50 – 80 C3 25 - 55 35 - 60 C2 0 - 25 70 - 85 C1
    35. 35. Clustering – k-Means Cluster Temperature Precipitation Ecosystem Forest 50 – 80 50 – 80 C3 Prairie 25 - 55 35 - 60 C2 Desert 0-25 70 - 85 C1
    36. 36. Using k-means <ul><li>Requires a priori knowledge of ‘k’ </li></ul><ul><li>The final outcome depends on the initial choice of k-means -- inconsistency </li></ul><ul><li>Sensitive to the outli ers, which can skew the means of their clusters </li></ul><ul><li>Favors spherical clusters – clusters may not match domain boundaries </li></ul><ul><li>Requires real-valued features </li></ul>
    37. 37. Cyberinfrastructure for Data Mining <ul><li>Resources – hardware and software (analysis tools and middleware) </li></ul><ul><li>Policies – allocating resources to the scientific community. Challenges to the traditional supercomputer model. Requirements for interactive and real-time analysis resources. </li></ul>
    38. 38. NSF TeraGrid Building Integrated National CyberInfrastructure <ul><li>Prototype for CyberInfrastructure </li></ul><ul><ul><li>Ubiquitous computational resources </li></ul></ul><ul><ul><li>Plug-in compatibility </li></ul></ul><ul><li>National Reach: </li></ul><ul><ul><li>SDSC, NCSA, CIT, ANL, PSC </li></ul></ul><ul><li>High Performance Network: </li></ul><ul><ul><li>40 Gb/s backbone, 30 Gb/s to each site </li></ul></ul><ul><li>Over 20 Teraflops compute power </li></ul><ul><li>Over 1PB Online Storage </li></ul><ul><li>8.9PB Archival Storage </li></ul>
    39. 39. SDSC is Data-Intensive Center
    40. 40. SDSC is Data-Intensive Center
    41. 41. SDSC Machine Room Data Architecture <ul><li>Philosophy: enable SDSC configuration to serve the grid as Data Center </li></ul><ul><li>.5 PB disk </li></ul><ul><li>6 PB archive </li></ul><ul><li>1 GB/s disk-to-tape </li></ul><ul><li>Optimized support for DB2 /Oracle </li></ul>Blue Horizon HPSS LAN (multiple GbE, TCP/IP) SAN (2 Gb/s, SCSI) Linux Cluster, 4TF Sun F15K WAN (30 Gb/s) SCSI/IP or FC/IP FC Disk Cache (400 TB) FC GPFS Disk (100TB) 200 MB/s per controller Silos and Tape, 6 PB, 1 GB/sec disk to tape 32 tape drives 30 MB/s per drive Database Engine Data Miner Vis Engine Local Disk (50TB) Power 4 Power 4 DB Blue Horizon: 1152 processor IBM SP, 1.7 Teraflops HPSS: over 600 TB data stored
    42. 42. SDSC IBM Regatta - DataStar <ul><li>100+ TB Disk </li></ul><ul><li>Numerous fast CPUs </li></ul><ul><li>64 GB of RAM per node </li></ul><ul><li>DB2 v8.x ESE </li></ul><ul><li>IBM Intelligent Miner </li></ul><ul><li>SAS Enterprise Miner </li></ul><ul><li>Platform for high-performance database, data mining, comparative IT studies … </li></ul>
    43. 43. Data Mining Tools used at SDSC <ul><li>SAS Enterprise Miner (Protein crystallization - JCSG) </li></ul><ul><li>IBM Intelligent Miner (Protein crystallization - JCSG, Corn Yield – Michigan State University, Security logs - SDSC) </li></ul><ul><li>CART (Protein crystallization - JCSG) </li></ul><ul><li>Matlab SVM package (TeraBridge health monitoring – UCSD Structural Engineering Department, North Temperate Lakes Monitoring - LTER) </li></ul><ul><li>PyML (Text Mining – NSDL, Hyperspectral data - LTER) </li></ul><ul><li>SKIDLkit by SDSC (Microarray analysis – UCSD Cancer Center, Hyperspectral data - LTER) </li></ul><ul><li>SVMlight (Hyperspectral data, LTER) </li></ul><ul><li>LSI by Telecordia (Text Mining – NSDL) </li></ul><ul><li>CoClustering by Fair Isaac (Text Mining – NSDL) </li></ul><ul><li>Matlab Bayes Net package </li></ul><ul><li>WEKA </li></ul>
    44. 44. SKIDLkit <ul><li>Toolkit for feature selection and classification </li></ul><ul><ul><li>Filter methods </li></ul></ul><ul><ul><li>Wrapper methods </li></ul></ul><ul><ul><li>Data normalization </li></ul></ul><ul><ul><li>Feature selection </li></ul></ul><ul><ul><li>Support Vector Machine & Naïve Bayesian Clustering </li></ul></ul><ul><ul><li>http://daks.sdsc.edu/skidl </li></ul></ul><ul><li>Will use it in the hands-on demo… </li></ul>
    45. 45. Survey of Applications at SDSC <ul><li>Text mining the NSDL (National Science Digital Library) collection </li></ul><ul><li>Sensor networks for bridge monitoring (with Structural Engineering Dept., UCSD) </li></ul><ul><li>Spatio-temporal Analysis of 9-1-1 Call Stream Data </li></ul><ul><li>Hyperspectral remote sensing data for groundcover classification (with Long Term Ecological Research Network - LTER) </li></ul><ul><li>Microarray analysis for tumor detection (with UCSD Cancer Center) </li></ul>
    46. 46. Application: Text Mining the National Science Digital Library (NSDL) Collection
    47. 47. Project Goal <ul><li>Assist the educators and students in finding relevant information by categorizing the materials by scientific discipline and grade level using contextual information </li></ul>General Approach Based on various metadata in the NSDL community, study the contents of the associated documents and apply machine learning algorithms
    48. 48. Source of Vocabulary <ul><li>Eisenhower National Clearinghouse </li></ul><ul><ul><li>8417 documents with labels specifying intended grade level </li></ul></ul><ul><ul><li>Documents are intended for the teachers </li></ul></ul><ul><ul><li>Selected subset of about 1350 documents that could be associated with a AAAS category </li></ul></ul><ul><ul><ul><li>Kindergarten-2nd </li></ul></ul></ul><ul><ul><ul><li>3rd-5th </li></ul></ul></ul><ul><ul><ul><li>6th - 8th </li></ul></ul></ul><ul><ul><ul><li>9th - 12th </li></ul></ul></ul>
    49. 49. Processing <ul><li>Identify the words used in the kindergarten-2nd grade levels by the teachers </li></ul><ul><li>Identify the new words used in each of the AAAS categories </li></ul><ul><li>Characterize the growth of the vocabulary </li></ul><ul><li>Characterize the complexity of the new terms (number of words from prior grade levels used to explain the new word). </li></ul>
    50. 50. Characterization of Learning 10 35% 10226 540 9th-12th 5 37% 6681 430 6th-8th 3 30% 4155 220 3rd-5th 1 2907 150 Kindergarten-2nd Complexity % new words Total words # of documents AAAS Level
    51. 51. Characterization of Learning <ul><li>Learn about 33% more words each AAAS category </li></ul><ul><ul><li>This is an exponential growth and must eventually saturate </li></ul></ul><ul><li>Complexity grows by about a factor of 2 per AAAS category </li></ul><ul><ul><li>In later grades, it takes more of your old vocabulary to interpret new words </li></ul></ul>
    52. 52. Text Mining the NSDL Variously Formatted Documents Strip Formatting Pick out content words using “ stop lists” Stemming Discard words that appear in every document or only one Word count, Term Weighting Generate Term Document Matrix Query: for a list of words, get docs with highest score Various Retrieval Schemes (LSI, Classification, or clustering modules) Processing pipeline
    53. 53. Application: Sensor Stream Mining
    54. 54. Sensor Networks for Bridge Monitoring <ul><li>Task: </li></ul><ul><ul><li>Identify which pier is damaged based on the data stream fed by the sensors at the span middles. </li></ul></ul><ul><ul><li>Apply multi-resolution technique </li></ul></ul><ul><li>Assumption: </li></ul><ul><ul><li>The lower end of a pier can be damaged (location of plastic hinge) </li></ul></ul><ul><ul><li>There is only one damaged pier at a time. </li></ul></ul>Sensors pier span middle
    55. 56. Application: Spatiotemporal Analysis of 9-1-1 Call Stream Data
    56. 57. Project Goal <ul><li>Perform spatiotemporal analysis on 9-1-1 call data to improve: </li></ul><ul><ul><li>Overall emergency planning </li></ul></ul><ul><ul><li>Real-time emergency decision support </li></ul></ul>General Approach Correlate call data “signatures” (unusual spatiotemporal trends) with State-wide and local events: - earthquakes, forest fires, weather events
    57. 58. Study Area and Dates: San Francisco Bay Area, April 2005 San Francisco Area
    58. 59. First Analysis: “Call Rhythm”
    59. 60. Application: Classification of Land Types Using Hyperspectral Data
    60. 61. Study Area New Mexico Sevilleta National Wildlife Refuge Study Area New Mexico
    61. 62. Previously Available Image/Map Types Relief Shaded Map Landsat Image
    62. 63. New image type: NASA’s JPL (Jet Propulsion Lab) Aviris (Airborne Visible/Infrared Imaging Spectrometer) scans, “hyperspectral images” Scanned from an altitude of 20km, 10km flightline 201 bands of electromagnetic information per pixel, spanning infrared to ultraviolet
    63. 64. Complete Aviris scan of the Sevilleta Wildlife refuge, 20m per pixel Hyperspectral Scans for Study Area Study Area…
    64. 66. Data set
    65. 67. Results Support Vector Machine, one-against-one, wavelet transformation: 97.1 % accuracy on test data
    66. 68. Application: Microarray Analysis for Tumor Detection
    67. 69. Microarray Analysis for Tumor Detection <ul><li>Characteristics of the Data: </li></ul><ul><ul><li>88 prostate tissue samples: </li></ul></ul><ul><ul><ul><li>37 labeled “no tumor”, </li></ul></ul></ul><ul><ul><ul><li>51 labeled “tumor” </li></ul></ul></ul><ul><ul><li>Each tissue with 10,600 gene expression measurements </li></ul></ul><ul><ul><li>Collected by the UCSD Cancer Center, analyzed at SDSC </li></ul></ul><ul><li>Tasks: </li></ul><ul><ul><li>Build model to classify new, unseen tissues as either “no tumor” or “tumor” </li></ul></ul><ul><ul><li>Identify key genes to determine their biological significance in the process of cancer </li></ul></ul>
    68. 70. No Tumor Tumor Simple classifier based on expression levels for two genes
    69. 71. Results
    70. 72. Break
    71. 73. Hands-on Analysis <ul><li>Part I: </li></ul><ul><ul><li>Decision Tree classification using IBM Intelligent Miner </li></ul></ul><ul><ul><li>Using classification models to make rational decisions </li></ul></ul><ul><ul><li>Peter Shin </li></ul></ul><ul><li>Part II: </li></ul><ul><ul><li>Feature selection, Naïve Bayes Classifiers and Support Vector Machines using SKIDLkit </li></ul></ul><ul><ul><li>Classification of microarray data </li></ul></ul><ul><ul><li>Hector Jasso </li></ul></ul>
    72. 74. Data Mining Example: Targeting Customers <ul><li>Problem Characteristics: </li></ul><ul><ul><ul><li>1. We make $50 profit on a sale of $200 shoes. </li></ul></ul></ul><ul><ul><ul><li>2. A preliminary study shows that people who make over $50k will buy the shoes at a rate of 5% when they receive the brochure. </li></ul></ul></ul><ul><ul><ul><li>3. People who make less than $50k will buy the shoes at a rate of 1% when they receive the brochure. </li></ul></ul></ul><ul><ul><ul><li>4. It costs $1 to send a brochure to a potential customer. </li></ul></ul></ul><ul><ul><ul><li>5. In general, we do not know whether a person will make more than $50k or not. </li></ul></ul></ul>
    73. 75. Available Information <ul><li>Variable Description </li></ul><ul><ul><ul><li>Please refer to the hand-out. </li></ul></ul></ul>
    74. 76. Possible Marketing Plans <ul><li>We will send out 30,000 brochures. </li></ul><ul><li>Plan A: Ignore data and randomly send brochures </li></ul><ul><li>Plan B: Use data mining to target a specific group with high probabilities of responding </li></ul>
    75. 77. Plan A <ul><li>Strategy: </li></ul><ul><ul><ul><li>Send brochures to anyone </li></ul></ul></ul><ul><li>Cost of sending one brochure = $1 </li></ul><ul><li>Probability of Response </li></ul><ul><ul><ul><li>1% of the population who make <= $50k ( 76% ) </li></ul></ul></ul><ul><ul><ul><li>5% of the population who make > $50k ( 24% ) </li></ul></ul></ul><ul><ul><ul><li>Resulting in: </li></ul></ul></ul><ul><ul><ul><li>( 1% * 76% + 5% * 24% ) = 1.96% final response rate </li></ul></ul></ul><ul><li>Earnings </li></ul><ul><ul><ul><li>Expected profit from one brochure = (Probability of response * profit – Cost of a brochure) </li></ul></ul></ul><ul><ul><ul><li>(1.96% * $50 - $1) = -$0.02 </li></ul></ul></ul><ul><ul><ul><li>Expected Earning = Expected profit from one brochure * number of brochures sent </li></ul></ul></ul><ul><ul><ul><li>-$0.02 * 30000 = -$600 </li></ul></ul></ul>
    76. 78. Plan B <ul><li>Strategy: </li></ul><ul><ul><ul><li>Send out brochures to only to: married, college or above, managerial/professional/sales/tech. support/protective service/armed forces, age >= 28.5, hours_per_week >= 31 </li></ul></ul></ul><ul><li>Cost of sending one brochure = $1 </li></ul><ul><li>Probability of Response </li></ul><ul><ul><ul><li>1% of the population who make <= $50k ( 20.6% ) </li></ul></ul></ul><ul><ul><ul><li>5% of the population who make > $50k ( 79.4% ) </li></ul></ul></ul><ul><ul><ul><li>Resulting in: </li></ul></ul></ul><ul><ul><ul><li>( 1% * 20.6% + 5% * 79.4% ) = 4.176% final response rate </li></ul></ul></ul><ul><li>Earnings </li></ul><ul><ul><ul><li>Expected profit from one brochure = (Probability of response * profit – Cost of a brochure) </li></ul></ul></ul><ul><ul><ul><li>(4.176% * $50 - $1) = $1.088 </li></ul></ul></ul><ul><ul><ul><li>(Probability of response * profit – Cost of a flier) * number of fliers </li></ul></ul></ul><ul><ul><ul><li>$1.088 * 30000 = $32,640 </li></ul></ul></ul>
    77. 79. Comparison of Two Plans <ul><li>Expected earning from plan A </li></ul><ul><ul><li>-$600 </li></ul></ul><ul><li>Expected earning from plan B </li></ul><ul><ul><li>$32,640 </li></ul></ul><ul><li>Net Difference </li></ul><ul><ul><li>$32,640 – (-$600) = $33,240 </li></ul></ul>
    78. 80. Acknowledgements <ul><li>Original source Census Bureau (1994) </li></ul><ul><li>Data processed and donated by Ron Kohavi and Barry Becker (Data Mining and Visualization, SGI) </li></ul>
    79. 81. Data Mining Example: Microarray Analysis “ Labeled” cases ( 38 bone marrow samples: 27 AML, 11 ALL Each contains 7129 gene expression values) Train model (using Neural Networks, Support Vector Machines, Bayesian nets, etc.) Model 34 New unlabeled bone marrow samples AML/ALL key genes
    80. 82. <ul><li>Few samples for analysis (38 labeled) </li></ul><ul><li>Extremely high-dimensional data (7129 gene expression values per sample) </li></ul><ul><li>Noisy data </li></ul><ul><li>Complex underlying mechanisms, not fully understood </li></ul>Microarray Data Challenges to Machine Learning Algorithms:
    81. 83. Some genes are more useful than others for building classification models Example: genes 36569_at and 36495_at are useful
    82. 84. Some genes are more useful than others for building classification models Example: genes 36569_at and 36495_at are useful AML ALL
    83. 85. Some genes are more useful than others for building classification models Example: genes 37176_at and 36563_at not useful
    84. 86. Importance of Feature (Gene) Selection <ul><li>Majority of genes are not directly related to leukemia </li></ul><ul><li>Having a large number of features enhances the model’s flexibility, but makes it prone to overfitting </li></ul><ul><li>Noise and the small number of training samples makes this even more likely </li></ul><ul><li>Some types of models, like Neural Networks do not scale well with many features </li></ul>
    85. 87. <ul><li>Distance metrics to capture class separation </li></ul><ul><li>Rank genes according to distance metric score </li></ul><ul><li>Choose the top n ranked genes </li></ul>With 7219 genes, how do we choose the best? HIGH score LOW score
    86. 88. Distance Metrics <ul><li>Tamayo’s Relative Class Separation </li></ul><ul><li>t -test </li></ul><ul><li>Bhattacharyya distance </li></ul>
    87. 89. A gene with an undetected outlier could score artificially high Score jumps from 0.00651 to 0.042566
    88. 90. How Support Vector Machines (SVMs) work
    89. 91. How Support Vector Machines (SVMs) work
    90. 92. How Support Vector Machines (SVMs) work
    91. 93. How Support Vector Machines (SVMs) work
    92. 94. How Support Vector Machines (SVMs) work
    93. 95. How Support Vector Machines (SVMs) work
    94. 96. How Support Vector Machines (SVMs) work margin Support vectors
    95. 97. How Support Vector Machines (SVMs) work margin Support vectors
    96. 98. How Support Vector Machines (SVMs) work margin Support vectors
    97. 99. <ul><li>Scales well to high-dimensional problems </li></ul><ul><li>Fast convergence to solution </li></ul><ul><li>Has well-defined statistical properties </li></ul>Characteristics of SVMs
    98. 100. … X (Class) w 1 w 2 w 3 w n output variable input variables Naïve Bayesian Classifiers
    99. 101. <ul><li>Scales well to high-dimensional problems </li></ul><ul><li>Fast to compute </li></ul><ul><li>Based on Bayesian probability theory </li></ul>Characteristics of Naïve Bayesian Classifiers
    100. 102. Approaches to Feature Selection Input Features Feature Selection by Distance Metric Score Train Model Feature Selection Search Feature Set Importance of features given by the model Filter Approach Wrapper Approach Input Features Model Train Model Model
    101. 103. <ul><li>Developed at SDSC: </li></ul><ul><ul><li>http://daks.sdsc.edu/skidl </li></ul></ul><ul><li>Implements: </li></ul><ul><ul><li>Filter and wrapper approaches </li></ul></ul><ul><ul><li>Naïve Bayesian Net and SVM </li></ul></ul><ul><ul><li>t -test, Prediction Strength, Bhattacharyya distance </li></ul></ul><ul><ul><li>Outlier detection </li></ul></ul>Software Available: SKIDLkit
    102. 104. <ul><li>Collected by White Institute Center for Genomics Research </li></ul><ul><li>Made available at: </li></ul><ul><ul><li>http://www-genome.wi.mit.edu/cgi-bin/cancer/datasets.cgi , </li></ul></ul><ul><ul><li>Under “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression” </li></ul></ul><ul><ul><li>Also availabe as a sample dataset in SKIDLkit </li></ul></ul>Leukemia Dataset