Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Data mining and its application and usage in medicine By Radhika
  2. 2. Data Mining and Medicine <ul><li>History </li></ul><ul><ul><li>Past 20 years with relational databases </li></ul></ul><ul><ul><ul><li>More dimensions to database queries </li></ul></ul></ul><ul><ul><li>earliest and most successful area of data mining </li></ul></ul><ul><ul><li>Mid 1800s in London hit by infectious disease </li></ul></ul><ul><ul><ul><li>Two theories </li></ul></ul></ul><ul><ul><ul><ul><li>Miasma theory  Bad air propagated disease </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Germ theory  Water-borne </li></ul></ul></ul></ul><ul><ul><ul><li>Advantages </li></ul></ul></ul><ul><ul><ul><ul><li>Discover trends even when we don’t understand reasons </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Discover irrelevant patterns that confuse than enlighten </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Protection against unaided human inference of patterns provide quantifiable measures and aid human judgment </li></ul></ul></ul></ul><ul><ul><li>Data Mining </li></ul></ul><ul><ul><ul><li>Patterns persistent and meaningful </li></ul></ul></ul><ul><ul><ul><li>Knowledge Discovery of Data </li></ul></ul></ul>
  3. 3. The future of data mining <ul><li>10 biggest killers in the US </li></ul><ul><li>Data mining = Process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data </li></ul>
  4. 4. Major Issues in Medical Data Mining <ul><li>Heterogeneity of medical data </li></ul><ul><ul><li>Volume and complexity </li></ul></ul><ul><ul><li>Physician’s interpretation </li></ul></ul><ul><ul><li>Poor mathematical categorization </li></ul></ul><ul><ul><li>Canonical Form </li></ul></ul><ul><ul><li>Solution: Standard vocabularies, interfaces between different sources of data integrations, design of electronic patient records </li></ul></ul><ul><li>Ethical, Legal and Social Issues </li></ul><ul><ul><li>Data Ownership </li></ul></ul><ul><ul><li>Lawsuits </li></ul></ul><ul><ul><li>Privacy and Security of Human Data </li></ul></ul><ul><ul><li>Expected benefits </li></ul></ul><ul><ul><li>Administrative Issues </li></ul></ul>
  5. 5. Why Data Preprocessing? <ul><li>Patient records consist of clinical, lab parameters, results of particular investigations, specific to tasks </li></ul><ul><ul><li>Incomplete : lacking attribute values, lacking certain attributes of interest, or containing only aggregate data </li></ul></ul><ul><ul><li>Noisy : containing errors or outliers </li></ul></ul><ul><ul><li>Inconsistent : containing discrepancies in codes or names </li></ul></ul><ul><ul><li>Temporal chronic diseases parameters </li></ul></ul><ul><li>No quality data, no quality mining results! </li></ul><ul><ul><li>Data warehouse needs consistent integration of quality data </li></ul></ul><ul><ul><li>Medical Domain, to handle incomplete, inconsistent or noisy data, need people with domain knowledge </li></ul></ul>
  6. 6. What is Data Mining? The KDD Process Data Cleaning Data Integration Databases Data Warehouse Knowledge Task-relevant Data Selection Data Mining Pattern Evaluation
  7. 7. From Tables and Spreadsheets to Data Cubes <ul><li>A data warehouse is based on a multidimensional data model that views data in the form of a data cube </li></ul><ul><li>A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions </li></ul><ul><ul><li>Dimension tables , such as item (item_name, brand, type), or time(day, week, month, quarter, year) </li></ul></ul><ul><ul><li>Fact table contains measures (such as dollars_sold) and keys to each of related dimension tables </li></ul></ul><ul><li>W. H. Inmon:“A data warehouse is a subject-oriented , integrated , time-variant , and nonvolatile collection of data in support of management’s decision-making process.” </li></ul>
  8. 8. Data Warehouse vs. Heterogeneous DBMS <ul><li>Data warehouse: update-driven, high performance </li></ul><ul><ul><li>Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis </li></ul></ul><ul><ul><li>Do not contain most current information </li></ul></ul><ul><ul><li>Query processing does not interfere with processing at local sources </li></ul></ul><ul><ul><li>Store and integrate historical information </li></ul></ul><ul><ul><li>Support complex multidimensional queries </li></ul></ul>
  9. 9. Data Warehouse vs. Operational DBMS <ul><li>OLTP (on-line transaction processing) </li></ul><ul><ul><li>Major task of traditional relational DBMS </li></ul></ul><ul><ul><li>Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. </li></ul></ul><ul><li>OLAP (on-line analytical processing) </li></ul><ul><ul><li>Major task of data warehouse system </li></ul></ul><ul><ul><li>Data analysis and decision making </li></ul></ul><ul><li>Distinct features (OLTP vs. OLAP): </li></ul><ul><ul><li>User and system orientation: customer vs. market </li></ul></ul><ul><ul><li>Data contents: current, detailed vs. historical, consolidated </li></ul></ul><ul><ul><li>Database design: ER + application vs. star + subject </li></ul></ul><ul><ul><li>View: current, local vs. evolutionary, integrated </li></ul></ul><ul><ul><li>Access patterns: update vs. read-only but complex queries </li></ul></ul>
  10. 11. Why Separate Data Warehouse? <ul><li>High performance for both systems </li></ul><ul><ul><li>DBMS tuned for OLTP: access methods, indexing, concurrency control, recovery </li></ul></ul><ul><ul><li>Warehouse tuned for OLAP: complex OLAP queries, multidimensional view, consolidation </li></ul></ul><ul><li>Different functions and different data: </li></ul><ul><ul><li>Missing data: Decision support requires historical data which operational DBs do not typically maintain </li></ul></ul><ul><ul><li>Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources </li></ul></ul><ul><ul><li>Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled </li></ul></ul>
  11. 14. Typical OLAP Operations <ul><li>Roll up (drill-up): summarize data </li></ul><ul><ul><li>by climbing up hierarchy or by dimension reduction </li></ul></ul><ul><li>Drill down (roll down): reverse of roll-up </li></ul><ul><ul><li>from higher level summary to lower level summary or detailed data, or introducing new dimensions </li></ul></ul><ul><li>Slice and dice: </li></ul><ul><ul><li>project and select </li></ul></ul><ul><li>Pivot (rotate): </li></ul><ul><ul><li>reorient the cube, visualization, 3D to series of 2D planes. </li></ul></ul><ul><li>Other operations </li></ul><ul><ul><li>drill across: involving (across) more than one fact table </li></ul></ul><ul><ul><li>drill through: through the bottom level of the cube to its back-end relational tables (using SQL) </li></ul></ul>
  12. 17. Multi-Tiered Architecture Data Warehouse OLAP Engine Analysis Query Reports Data mining Monitor & Integrator Metadata Data Sources Front-End Tools Serve Data Marts Data Storage OLAP Server Extract Transform Load Refresh Operational DBs other sources
  13. 18. Steps of a KDD Process <ul><li>Learning the application domain: </li></ul><ul><ul><li>relevant prior knowledge and goals of application </li></ul></ul><ul><li>Creating a target data set: data selection </li></ul><ul><li>Data cleaning and preprocessing: (may take 60% of effort!) </li></ul><ul><li>Data reduction and transformation: </li></ul><ul><ul><li>Find useful features, dimensionality/variable reduction, invariant representation. </li></ul></ul><ul><li>Choosing functions of data mining </li></ul><ul><ul><li>summarization, classification, regression, association, clustering. </li></ul></ul><ul><li>Choosing the mining algorithm(s) </li></ul><ul><li>Data mining: search for patterns of interest </li></ul><ul><li>Pattern evaluation and knowledge presentation </li></ul><ul><ul><li>visualization, transformation, removing redundant patterns, etc. </li></ul></ul><ul><li>Use of discovered knowledge </li></ul>
  14. 19. Common Techniques in Data Mining <ul><li>Predictive Data Mining </li></ul><ul><ul><li>Most important </li></ul></ul><ul><ul><li>Classification: Relate one set of variables in data to response variables </li></ul></ul><ul><ul><li>Regression: estimate some continuous value </li></ul></ul><ul><li>Descriptive Data Mining </li></ul><ul><ul><li>Clustering: Discovering groups of similar instances </li></ul></ul><ul><ul><li>Association rule extraction </li></ul></ul><ul><ul><ul><li>Variables/Observations </li></ul></ul></ul><ul><ul><li>Summarization of group descriptions </li></ul></ul>
  15. 20. Leukemia <ul><li>Different types of cells look very similar </li></ul><ul><li>Given a number of samples (patients) </li></ul><ul><ul><li>can we diagnose the disease accurately? </li></ul></ul><ul><ul><li>Predict the outcome of treatment? </li></ul></ul><ul><ul><li>Recommend best treatment based of previous treatments? </li></ul></ul><ul><li>Solution: Data mining on micro-array data </li></ul><ul><li>38 training patients, 34 testing patients ~ 7000 patient attributes </li></ul><ul><li>2 classes: Acute Lymphoblastic Leukemia(ALL) vs Acute Myeloid Leukemia (AML) </li></ul>
  16. 21. Clustering/Instance Based Learning <ul><li>Uses specific instances to perform classification than general IF THEN rules </li></ul><ul><li>Nearest Neighbor classifier </li></ul><ul><li>Most studied algorithms for medical purposes </li></ul><ul><li>Clustering– Partitioning a data set into several groups (clusters) such that </li></ul><ul><ul><li>Homogeneity: Objects belonging to the same cluster are similar to each other </li></ul></ul><ul><ul><li>Separation: Objects belonging to different clusters are dissimilar to each other.  </li></ul></ul><ul><li>Three elements </li></ul><ul><ul><li>The set of objects </li></ul></ul><ul><ul><li>The set of attributes </li></ul></ul><ul><ul><li>Distance measure </li></ul></ul>
  17. 22. Measure the Dissimilarity of Objects <ul><li>Find best matching instance </li></ul><ul><li>Distance function </li></ul><ul><ul><li>Measure the dissimilarity between a pair of data objects </li></ul></ul><ul><li>Things to consider </li></ul><ul><ul><li>Usually very different for interval-scaled , boolean , nominal , ordinal and ratio-scaled variables </li></ul></ul><ul><ul><li>Weights should be associated with different variables based on applications and data semantic </li></ul></ul><ul><li>Quality of a clustering result depends on both the distance measure adopted and its implementation </li></ul>
  18. 23. Minkowski Distance <ul><li>Minkowski distance: a generalization </li></ul><ul><li>If q = 2, d is Euclidean distance </li></ul><ul><li>If q = 1, d is Manhattan distance </li></ul>x i x j q=2 q=1 6 6 12 8.48 X i (1,7) X j (7,1)
  19. 24. Binary Variables <ul><li>A contingency table for binary data </li></ul><ul><li>Simple matching coefficient </li></ul>Object i Object j
  20. 25. Dissimilarity between Binary Variables <ul><li>Example </li></ul>Object 1 Object 2 1 0 0 0 1 1 1 Object 2 0 0 1 1 1 0 1 Object 1 A7 A6 A5 A4 A3 A2 A1 7 3 4 sum 3 1 2 0 4 2 2 1 sum 0 1
  21. 26. K-nearest neighbors algorithm <ul><li>Initialization </li></ul><ul><ul><li>Arbitrarily choose k objects as the initial cluster centers (centroids) </li></ul></ul><ul><li>Iteration until no change </li></ul><ul><ul><li>For each object O i </li></ul></ul><ul><ul><ul><li>Calculate the distances between O i and the k centroids </li></ul></ul></ul><ul><ul><ul><li>(Re)assign O i to the cluster whose centroid is the closest to O i </li></ul></ul></ul><ul><ul><li>Update the cluster centroids based on current assignment </li></ul></ul>
  22. 27. k -Means Clustering Method cluster mean current clusters new clusters objects relocated
  23. 28. Dataset <ul><li>Data set from UCI repository </li></ul><ul><li>http://kdd.ics.uci.edu/ </li></ul><ul><li>768 female Pima Indians evaluated for diabetes </li></ul><ul><li>After data cleaning 392 data entries </li></ul>
  24. 29. Hierarchical Clustering <ul><li>Groups observations based on dissimilarity </li></ul><ul><li>Compacts database into “labels” that represent the observations </li></ul><ul><li>Measure of similarity/Dissimilarity </li></ul><ul><ul><li>Euclidean Distance </li></ul></ul><ul><ul><li>Manhattan Distance </li></ul></ul><ul><li>Types of Clustering </li></ul><ul><ul><li>Single Link </li></ul></ul><ul><ul><li>Average Link </li></ul></ul><ul><ul><li>Complete Link </li></ul></ul>
  25. 30. Hierarchical Clustering: Comparison Average-link Centroid distance Single-link Complete-link 1 2 3 4 5 6 1 2 5 3 4 1 2 3 4 5 6 1 2 5 3 4 1 2 3 4 5 6 1 2 5 3 4 1 2 3 4 5 6 1 2 3 4 5
  26. 31. Compare Dendrograms 2 5 3 6 4 1 Average-link Centroid distance Single-link Complete-link 1 2 5 3 6 4 1 2 5 3 6 4 1 2 5 3 6 4
  27. 32. Which Distance Measure is Better? <ul><li>Each method has both advantages and disadvantages; application-dependent </li></ul><ul><li>Single-link </li></ul><ul><ul><li>Can find irregular-shaped clusters </li></ul></ul><ul><ul><li>Sensitive to outliers </li></ul></ul><ul><li>Complete-link, Average-link, and Centroid distance </li></ul><ul><ul><li>Robust to outliers </li></ul></ul><ul><ul><li>Tend to break large clusters </li></ul></ul><ul><ul><li>Prefer spherical clusters </li></ul></ul>
  28. 33. Dendrogram from dataset <ul><li>Minimum spanning tree through the observations </li></ul><ul><li>Single observation that is last to join the cluster is patient whose blood pressure is at bottom quartile, skin thickness is at bottom quartile and BMI is in bottom half </li></ul><ul><li>Insulin was however largest and she is 59-year old diabetic </li></ul>
  29. 34. Dendrogram from dataset <ul><li>Maximum dissimilarity between observations in one cluster when compared to another </li></ul>
  30. 35. Dendrogram from dataset <ul><li>Average dissimilarity between observations in one cluster when compared to another </li></ul>
  31. 36. Supervised versus Unsupervised Learning <ul><li>Supervised learning (classification) </li></ul><ul><ul><li>Supervision: Training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations </li></ul></ul><ul><ul><li>New data is classified based on training set </li></ul></ul><ul><li>Unsupervised learning (clustering) </li></ul><ul><ul><li>Class labels of training data are unknown </li></ul></ul><ul><ul><li>Given a set of measurements, observations, etc., need to establish existence of classes or clusters in data </li></ul></ul>
  32. 37. <ul><li>Derive models that can use patient specific information, aid clinical decision making </li></ul><ul><li>Apriori decision on predictors and variables to predict </li></ul><ul><li>No method to find predictors that are not present in the data </li></ul><ul><li>Numeric Response </li></ul><ul><ul><li>Least Squares Regression </li></ul></ul><ul><li>Categorical Response </li></ul><ul><ul><li>Classification trees </li></ul></ul><ul><ul><li>Neural Networks </li></ul></ul><ul><ul><li>Support Vector Machine </li></ul></ul><ul><li>Decision models </li></ul><ul><ul><li>Prognosis, Diagnosis and treatment planning </li></ul></ul><ul><ul><li>Embed in clinical information systems </li></ul></ul>Classification and Prediction
  33. 38. Least Squares Regression <ul><li>Find a linear function of predictor variables that minimize the sum of square difference with response </li></ul><ul><li>Supervised learning technique </li></ul><ul><li>Predict insulin in our dataset :glucose and BMI </li></ul>
  34. 39. Decision Trees <ul><li>Decision tree </li></ul><ul><ul><li>Each internal node tests an attribute </li></ul></ul><ul><ul><li>Each branch corresponds to attribute value </li></ul></ul><ul><ul><li>Each leaf node assigns a classification </li></ul></ul><ul><li>ID3 algorithm </li></ul><ul><ul><li>Based on training objects with known class labels to classify testing objects </li></ul></ul><ul><ul><li>Rank attributes with information gain measure </li></ul></ul><ul><ul><li>Minimal height </li></ul></ul><ul><ul><ul><li>least number of tests to classify an object </li></ul></ul></ul><ul><ul><li>Used in commercial tools eg: Clementine </li></ul></ul><ul><ul><li>ASSISTANT </li></ul></ul><ul><ul><ul><li>Deal with medical datasets </li></ul></ul></ul><ul><ul><ul><li>Incomplete data </li></ul></ul></ul><ul><ul><ul><li>Discretize continuous variables </li></ul></ul></ul><ul><ul><ul><li>Prune unreliable parts of tree </li></ul></ul></ul><ul><ul><ul><li>Classify data </li></ul></ul></ul>
  35. 40. Decision Trees
  36. 41. Algorithm for Decision Tree Induction <ul><li>Basic algorithm (a greedy algorithm) </li></ul><ul><ul><li>Attributes are categorical (if continuous-valued, they are discretized in advance) </li></ul></ul><ul><ul><li>Tree is constructed in a top-down recursive divide-and-conquer manner </li></ul></ul><ul><ul><li>At start, all training examples are at the root </li></ul></ul><ul><ul><li>Test attributes are selected on basis of a heuristic or statistical measure (e.g., information gain) </li></ul></ul><ul><ul><li>Examples are partitioned recursively based on selected attributes </li></ul></ul>
  37. 42. Training Dataset no excellent no medium 31…40 P14 yes fair yes high >40 P13 yes excellent no medium >40 P12 yes excellent yes medium <=30 P11 yes fair yes medium 31…40 P10 yes fair yes low <=30 P9 no fair no medium <=30 P8 yes excellent yes low >40 P7 no excellent yes low 31…40 P6 yes fair yes low 31…40 P5 yes fair no medium 31…40 P4 yes fair no high >40 P3 no excellent no high <=30 P2 no fair no high <=30 P1 Risk of Condition X Vision Hereditary BMI Age
  38. 43. Construction of A Decision Tree for “Condition X” Age? >40 30…40 <=30 [P1,…P14] Yes: 9, No:5 [P1,P2,P8,P9,P11] Yes: 2, No:3 [P3,P7,P12,P13] Yes: 4, No:0 [P4,P5,P6,P10,P14] Yes: 3, No:2 History no yes [P1,P2,P8] Yes: 0, No:3 [P9,P11] Yes: 2, No:0 Vision fair excellent YES NO YES NO YES [P6,P14] Yes: 0, No:2 [P4,P5,P10] Yes: 3, No:0
  39. 44. Entropy and Information Gain <ul><li>S contains s i tuples of class C i for i = {1, ..., m } </li></ul><ul><li>Information measures info required to classify any arbitrary tuple </li></ul><ul><li>Entropy of attribute A with values {a 1 ,a 2 ,…,a v } </li></ul><ul><li>Information gained by branching on attribute A </li></ul>
  40. 45. Entropy and Information Gain <ul><li>Select attribute with the highest information gain (or greatest entropy reduction) </li></ul><ul><ul><li>Such attribute minimizes information needed to classify samples </li></ul></ul>
  41. 46. Rule Induction <ul><li>IF conditions THEN Conclusion </li></ul><ul><li>Eg: CN2 </li></ul><ul><ul><li>Concept description: </li></ul></ul><ul><ul><ul><li>Characterization : provides a concise and succinct summarization of given collection of data </li></ul></ul></ul><ul><ul><ul><li>Comparison : provides descriptions comparing two or more collections of data </li></ul></ul></ul><ul><li>Training set, testing set </li></ul><ul><li>Imprecise </li></ul><ul><li>Predictive Accuracy </li></ul><ul><ul><li>P/P+N </li></ul></ul>
  42. 47. Example used in a Clinic <ul><li>Hip arthoplasty trauma surgeon predict patient’s long-term clinical status after surgery </li></ul><ul><li>Outcome evaluated during follow-ups for 2 years </li></ul><ul><li>2 modeling techniques </li></ul><ul><ul><li>Naïve Bayesian classifier </li></ul></ul><ul><ul><li>Decision trees </li></ul></ul><ul><li>Bayesian classifier </li></ul><ul><ul><li>P(outcome=good) = 0.55 (11/20 good) </li></ul></ul><ul><ul><li>Probability gets updated as more attributes are considered </li></ul></ul><ul><ul><li>P(timing=good|outcome=good) = 9/11 (0.846) </li></ul></ul><ul><ul><li>P(outcome = bad) = 9/20 P(timing=good|outcome=bad) = 5/9 </li></ul></ul>
  43. 48. Nomogram
  44. 49. Bayesian Classification <ul><li>Bayesian classifier vs. decision tree </li></ul><ul><ul><li>Decision tree: predict the class label </li></ul></ul><ul><ul><li>Bayesian classifier: statistical classifier; predict class membership probabilities </li></ul></ul><ul><li>Based on Bayes theorem ; estimate posterior probability </li></ul><ul><li>Naïve Bayesian classifier: </li></ul><ul><ul><li>Simple classifier that assumes attribute independence </li></ul></ul><ul><ul><li>High speed when applied to large databases </li></ul></ul><ul><ul><li>Comparable in performance to decision trees </li></ul></ul>
  45. 50. Bayes Theorem <ul><li>Let X be a data sample whose class label is unknown </li></ul><ul><li>Let H i be the hypothesis that X belongs to a particular class C i </li></ul><ul><li>P( H i ) is class prior probability that X belongs to a particular class C i </li></ul><ul><ul><li>Can be estimated by n i / n from training data samples </li></ul></ul><ul><ul><li>n is the total number of training data samples </li></ul></ul><ul><ul><li>n i is the number of training data samples of class C i </li></ul></ul>Formula of Bayes Theorem
  46. 51. More classification Techniques <ul><li>Neural Networks </li></ul><ul><ul><li>Similar to pattern recognition properties of biological systems </li></ul></ul><ul><ul><li>Most frequently used </li></ul></ul><ul><ul><ul><li>Multi-layer perceptrons </li></ul></ul></ul><ul><ul><ul><ul><li>Input with bias, connected by weights to hidden, output </li></ul></ul></ul></ul><ul><ul><ul><li>Backpropagation neural networks </li></ul></ul></ul><ul><li>Support Vector Machines </li></ul><ul><ul><li>Separate database to mutually exclusive regions </li></ul></ul><ul><ul><ul><li>Transform to another problem space </li></ul></ul></ul><ul><ul><ul><li>Kernel functions (dot product) </li></ul></ul></ul><ul><ul><ul><li>Output of new points predicted by position </li></ul></ul></ul><ul><li>Comparison with classification trees </li></ul><ul><ul><li>Not possible to know which features or combination of features most influence a prediction </li></ul></ul>
  47. 52. Multilayer Perceptrons <ul><li>Non-linear transfer functions to weighted sums of inputs </li></ul><ul><li>Werbos algorithm </li></ul><ul><ul><li>Random weights </li></ul></ul><ul><ul><li>Training set, Testing set </li></ul></ul>
  48. 53. Support Vector Machines <ul><li>3 steps </li></ul><ul><ul><li>Support Vector creation </li></ul></ul><ul><ul><li>Maximal distance between points found </li></ul></ul><ul><ul><li>Perpendicular decision boundary </li></ul></ul><ul><li>Allows some points to be misclassified </li></ul><ul><li>Pima Indian data with X1(glucose) X2(BMI) </li></ul>
  49. 54. What is Association Rule Mining? <ul><li>Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories </li></ul>Example of Association Rules { High LDL, Low HDL }  { Heart Failure } <ul><li>People who have high LDL (“bad” cholesterol), low HDL (“good cholesterol”) are at </li></ul><ul><li>higher risk of heart failure. </li></ul>High BMI , High LDL Low HDL , Heart Failure 5 High LDL Low HDL , Heart Failure 4 Diabetes 3 High LDL Low HDL , Heart Failure, Diabetes 2 High LDL Low HDL, High BMI, Heart Failure 1 Conditions PatientID
  50. 55. Association Rule Mining <ul><li>Market Basket Analysis </li></ul><ul><ul><li>Same groups of items bought placed together </li></ul></ul><ul><ul><li>Healthcare </li></ul></ul><ul><ul><ul><li>Understanding among association among patients with demands for similar treatments and services </li></ul></ul></ul><ul><ul><li>Goal : find items for which joint probability of occurrence is high </li></ul></ul><ul><ul><li>Basket of binary valued variables </li></ul></ul><ul><ul><li>Results form association rules, augmented with support and confidence </li></ul></ul>
  51. 56. Association Rule Mining <ul><li>Association Rule </li></ul><ul><ul><li>An implication expression of the form X  Y, where X and Y are itemsets and X  Y=  </li></ul></ul><ul><li>Rule Evaluation Metrics </li></ul><ul><ul><li>Support (s): Fraction of transactions that contain both X and Y </li></ul></ul><ul><ul><li>Confidence (c): Measures how often items in Y appear in transactions that contain X </li></ul></ul>Trans containing Y Trans containing both X and Y Trans containing X D
  52. 57. The Apriori Algorithm <ul><li>Starts with most frequent 1-itemset </li></ul><ul><li>Include only those “items” that pass threshold </li></ul><ul><li>Use 1-itemset to generate 2-itemsets </li></ul><ul><li>Stop when threshold not satisfied by any itemset </li></ul><ul><li>L 1 = {frequent items}; </li></ul><ul><li>for (k = 1; L k !=  ; k++) do </li></ul><ul><ul><li>Candidate Generation: C k+1 = candidates generated from L k ; </li></ul></ul><ul><ul><li>Candidate Counting: for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t </li></ul></ul><ul><ul><li>L k+1 = candidates in C k+ 1 with min_sup </li></ul></ul><ul><li>return  k L k ; </li></ul>
  53. 58. Apriori-based Mining b, e 40 a, b, c, e 30 b, c, e 20 a, c, d 10 Items TID Min_sup=0.5 1 d 3 e 3 c 3 b 2 a Sup Itemset Data base D 1-candidates Scan D 3 e 3 c 3 b 2 a Sup Itemset Freq 1-itemsets bc ae ac ce be ab Itemset 2-candidates ce be bc ae ac ab Itemset 2 1 2 2 3 1 Sup Counting Scan D ce be bc ac Itemset 2 2 2 3 Sup Freq 2-itemsets bce Itemset 3-candidates bce Itemset 2 Sup Freq 3-itemsets Scan D
  54. 59. Principle Component Analysis <ul><li>Principle Components </li></ul><ul><ul><li>In cases of large number of variables, highly possible that some subsets of the variables are very correlated with each other. Reduce variables but retain variability in dataset </li></ul></ul><ul><ul><li>Linear combinations of variables in the database </li></ul></ul><ul><ul><ul><li>Variance of each PC maximized </li></ul></ul></ul><ul><ul><ul><ul><li>Display as much spread of the original data </li></ul></ul></ul></ul><ul><ul><ul><li>PC orthogonal with each other </li></ul></ul></ul><ul><ul><ul><ul><li>Minimize the overlap in the variables </li></ul></ul></ul></ul><ul><ul><ul><li>Each component normalized sum of square is unity </li></ul></ul></ul><ul><ul><ul><ul><li>Easier for mathematical analysis </li></ul></ul></ul></ul><ul><ul><li>Number of PC < Number of variables </li></ul></ul><ul><ul><ul><li>Associations found </li></ul></ul></ul><ul><ul><ul><li>Small number of PC explain large amount of variance </li></ul></ul></ul><ul><ul><li>Example 768 female Pima Indians evaluated for diabetes </li></ul></ul><ul><ul><ul><li>Number of times pregnant, two-hour oral glucose tolerance test (OGTT) plasma glucose, Diastolic blood pressure, Triceps skin fold thickness, Two-hour serum insulin, BMI, Diabetes pedigree function, Age, Diabetes onset within last 5 years </li></ul></ul></ul>
  55. 60. PCA Example
  56. 61. National Cancer Institute <ul><li>CancerNet http://www.nci.nih.gov </li></ul><ul><li>CancerNet for Patients and the Public </li></ul><ul><li>CancerNet for Health Professionals </li></ul><ul><li>CancerNet for Basic Reasearchers </li></ul><ul><li>CancerLit </li></ul>
  57. 62. Conclusion <ul><li>About ¾ billion of people’s medical records are electronically available </li></ul><ul><li>Data mining in medicine distinct from other fields due to nature of data: heterogeneous, with ethical, legal and social constraints </li></ul><ul><li>Most commonly used technique is classification and prediction with different techniques applied for different cases </li></ul><ul><li>Associative rules describe the data in the database </li></ul><ul><li>Medical data mining can be the most rewarding despite the difficulty </li></ul>
  58. 63. <ul><li>Thank you !!! </li></ul>