Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Machine Learning and Data Mining 15-381 3-April-2003 Jaime Carbonell
General Topic: Data Mining <ul><li>Typology of Machine Learning </li></ul><ul><li>Data Bases (brief review/intro) </li></u...
Typology of Machine Learning Methods (1) <ul><li>Learning by caching </li></ul><ul><ul><li>What/when to cache </li></ul></...
Typology of Machine Learning Methods (2) <ul><li>&quot;Speedup&quot; Learning </li></ul><ul><ul><li>Tuning search heuristi...
Data Bases in a Nutshell (1) <ul><li>Ingredients </li></ul><ul><li>A Data Base is a set of one or more rectangular tables ...
Data Bases in a Nutshell (2) <ul><li>Ingredients </li></ul><ul><li>A data-table  scheme  is just the list of table column ...
Data Bases in a Nutshell (3) <ul><li>A Generic DB table </li></ul><ul><li>Attr 1 , Attr 2 , ...,  Attr n </li></ul><ul><li...
Example DB tables (1) <ul><li>Customer DB Table </li></ul><ul><li>Customer-Schema = (SSN, Name, YOB, DOA, user-id) </li></...
Example DB tables (2) <ul><li>Transaction DB table </li></ul><ul><li>Transaction-Schema = (user-id, DOT, product, help, tc...
Data Bases Facts (1) <ul><li>DB Tables </li></ul><ul><li>m =< O(10 6 ), n =< O(10 2 ) </li></ul><ul><li>matrix T i,j  (a D...
Data Bases Facts (2) <ul><li>DB Query </li></ul><ul><li>Relational algebra query system (e.g. SQL) </li></ul><ul><li>Retri...
Data Base Design Issues (1) <ul><li>Design Issues </li></ul><ul><li>What additional table(s) are needed? </li></ul><ul><li...
Data Base Design Issues (2) <ul><li>Unique keys </li></ul><ul><li>Any column can serve as search key </li></ul><ul><li>Sup...
Data Base Design Issues (3) <ul><li>Drops and errors </li></ul><ul><li>Missing data -- always happens </li></ul><ul><li>Er...
Data Base Design Issues (4) <ul><li>Text Mining </li></ul><ul><li>Rows in T m,n  are document vectors </li></ul><ul><li>n ...
DATA MINING [Supervised] (1) <ul><li>Given: </li></ul><ul><li>A data base table  T m,n </li></ul><ul><li>Predictor attribu...
DATA MINING [Supervised] (2) <ul><li>Where typically: </li></ul><ul><li>There is only one  t k  of interest and therefore ...
DATA MINING APPLICATIONS (1) <ul><li>FINANCE: </li></ul><ul><li>Credit-card & Loan Fraud Detection </li></ul><ul><li>Time ...
DATA MINING APPLICATIONS (2) <ul><li>MANUFACTURING: </li></ul><ul><li>Numerical Controller Optimizations </li></ul><ul><li...
Simple Data Mining Example (1) <ul><li>Tot  Num  Max  Num </li></ul><ul><li>Acct.  Income  Job  Delinq  Delinq Owns  Credi...
Simple Data Mining Example (2) <ul><li>Tot  Num  Max  Num </li></ul><ul><li>Acct.  Income  Job  Delinq  Delinq Owns  Credi...
Simple Data Mining Example (3) <ul><li>Tot  Num  Max  Num </li></ul><ul><li>Acct.  Income  Job  Delinq  Delinq Owns  Credi...
Trend Detection in DM (1) <ul><li>Example: Sales Prediction </li></ul><ul><li>2003 Q1 sales = 4.0M, </li></ul><ul><li>2003...
Trend Detection in DM (2) <ul><li>Now if we knew  last year : </li></ul><ul><li>2002 Q1 sales = 3.5M, </li></ul><ul><li>20...
Trend Detection in DM (3) <ul><li>What will 2001 Q4 sales be? </li></ul><ul><li>What if Christmas 2000 was cancelled? </li...
Trend Detection in DM II (1) <ul><li>Methods </li></ul><ul><li>Numerical series extrapolation </li></ul><ul><li>Cyclical c...
Trend Detection in DM II (2) <ul><li>Thorny Problems </li></ul><ul><li>How to use external knowledge* to make up for limit...
Methods for Supervised DM (1) <ul><li>Classifiers </li></ul><ul><li>Linear Separators (regression) </li></ul><ul><li>Naive...
Methods for Supervised DM (2) <ul><li>Points of Comparison </li></ul><ul><li>Hard vs Soft decisions </li></ul><ul><li>(e.g...
Symbolic Rule Induction (1) <ul><li>General idea </li></ul><ul><li>Labeled instances are DB tuples </li></ul><ul><li>Rules...
Symbolic Rule Induction (2) <ul><li>Example term generalizations </li></ul><ul><li>Constant => disjunction </li></ul><ul><...
Symbolic Rule Induction (3) <ul><li>Example term specializations </li></ul><ul><li>class => disjunction of subclasses </li...
Symbolic Rule Induction Example (1) <ul><li>Age Gender Temp b-cult c-cult loc Skin  disease </li></ul><ul><li>65  M  101 +...
Symbolic Rule Induction Example (2) <ul><li>Candidate Rules: </li></ul><ul><li>IF age = [12,65] </li></ul><ul><li>gender =...
Evaluation of ML/DM Methods <ul><li>Split labeled data into  training & test  sets </li></ul><ul><li>Apply ML (d-tree, rul...
Upcoming SlideShare
Loading in …5
×

Data Mining in eCommerce Web-Based Information Architectures

747 views

Published on

  • Be the first to comment

Data Mining in eCommerce Web-Based Information Architectures

  1. 1. Machine Learning and Data Mining 15-381 3-April-2003 Jaime Carbonell
  2. 2. General Topic: Data Mining <ul><li>Typology of Machine Learning </li></ul><ul><li>Data Bases (brief review/intro) </li></ul><ul><li>Data Mining (DM) </li></ul><ul><li>Supervised Learning Methods in DM </li></ul><ul><li>Evaluating ML/DM Systems </li></ul>
  3. 3. Typology of Machine Learning Methods (1) <ul><li>Learning by caching </li></ul><ul><ul><li>What/when to cache </li></ul></ul><ul><ul><li>When to use/invalidate/update cache </li></ul></ul><ul><li>Learning from Examples </li></ul><ul><li>(aka &quot;Supervised&quot; learning) </li></ul><ul><ul><li>Labeled examples for training </li></ul></ul><ul><ul><li>Learn the mapping from examples to labels </li></ul></ul><ul><ul><li>E.g.: Naive Bayes, Decision Trees, ... </li></ul></ul><ul><ul><li>Text Categorization (using kNN or other means) </li></ul></ul><ul><li>is a learning-from-examples task </li></ul>
  4. 4. Typology of Machine Learning Methods (2) <ul><li>&quot;Speedup&quot; Learning </li></ul><ul><ul><li>Tuning search heuristics from experience </li></ul></ul><ul><ul><li>Inducing explicit control knowledge </li></ul></ul><ul><ul><li>Analogical learning (generalized instances) </li></ul></ul><ul><li>Optimization &quot;policy&quot; learning </li></ul><ul><ul><li>Predicting continuous objective function </li></ul></ul><ul><ul><li>E.g. Regression, Reinforcement, ... </li></ul></ul><ul><li>New Pattern Discovery </li></ul><ul><li>(aka &quot;Unsupervised&quot; Learning) </li></ul><ul><ul><li>Finding meaningful correlations in data </li></ul></ul><ul><ul><li>E.g. association rules, clustering, ... </li></ul></ul>
  5. 5. Data Bases in a Nutshell (1) <ul><li>Ingredients </li></ul><ul><li>A Data Base is a set of one or more rectangular tables (aka &quot;matrices&quot;, &quot;relational tables&quot;). </li></ul><ul><li>Each table consists of m records (aka, &quot;tuples&quot;) </li></ul><ul><li>Each of the m records consists of n values, one for each of the n attributes </li></ul><ul><li>Each column in the table consist of all the values for the attribute it represents </li></ul>
  6. 6. Data Bases in a Nutshell (2) <ul><li>Ingredients </li></ul><ul><li>A data-table scheme is just the list of table column headers in their left-to-right order. Think of it as a table with no records. </li></ul><ul><li>A data-table instance is the content of the table (i.e. a set of records) consistent with the scheme. </li></ul><ul><li>For real data bases: m >> n. </li></ul>
  7. 7. Data Bases in a Nutshell (3) <ul><li>A Generic DB table </li></ul><ul><li>Attr 1 , Attr 2 , ..., Attr n </li></ul><ul><li>Record-1 t 1,1 , t 1,2 , ..., t 1,n </li></ul><ul><li>Record-2 t 2,1 , t 2,2 , ..., t 2,n </li></ul><ul><li>. . </li></ul><ul><li>. . </li></ul><ul><li>. . </li></ul><ul><li>Record-m t m,1 , t m,2 , ..., t m,n </li></ul>
  8. 8. Example DB tables (1) <ul><li>Customer DB Table </li></ul><ul><li>Customer-Schema = (SSN, Name, YOB, DOA, user-id) </li></ul>… … … … … asmith2 24-04-00 1972 Smith 333-10-0066 suzuki 24-04-00 1948 Suzuki 404-10-1111 jjones 11-02-99 1962 Jones 034-67-1188 asmith 12-07-99 1954 Smith 110-20-3003 user-id DOA YOB Name SSN
  9. 9. Example DB tables (2) <ul><li>Transaction DB table </li></ul><ul><li>Transaction-Schema = (user-id, DOT, product, help, tcode) </li></ul>… … … … … … 7.95 10010 Y book-1922 09-05-00 jjones 21.95 10009 N CD-2380 06-05-00 asmith2 12.95 10008 Y CD-2380 05-05-00 jjones 39.95 10007 N book-7702 05-05-00 suzuki 0.00 10006 Y *err* 01-05-00 jjones 19.95 10005 N CD-1131 30-04-00 asmith2 18.95 10004 N CD-1129 30-04-00 asmith2 44.50 10003 Y book-5011 25-04-00 suzuki 18.95 10002 N CD-1129 25-04-00 asmith2 23.95 10001 N book-2241 24-04-00 asmith2 price tcode help product DOT user-id
  10. 10. Data Bases Facts (1) <ul><li>DB Tables </li></ul><ul><li>m =< O(10 6 ), n =< O(10 2 ) </li></ul><ul><li>matrix T i,j (a DB &quot;table&quot;) is dense </li></ul><ul><li>Each t i,j is any scalar data type </li></ul><ul><li>(real, integer, boolean, string,...) </li></ul><ul><li>All entries in a given column of a DB-table must have the same data type. </li></ul>
  11. 11. Data Bases Facts (2) <ul><li>DB Query </li></ul><ul><li>Relational algebra query system (e.g. SQL) </li></ul><ul><li>Retrieves individual records, subsets of tables, or information liked across tables (DB joins on unique fields) </li></ul><ul><li>See DB optional textbook for details </li></ul>
  12. 12. Data Base Design Issues (1) <ul><li>Design Issues </li></ul><ul><li>What additional table(s) are needed? </li></ul><ul><li>Why do we need multiple DB tables? </li></ul><ul><li>Why not encode everything into one big table? </li></ul><ul><li>How do we search a DB table? </li></ul><ul><li>How about the full DB? </li></ul><ul><li>How do we update a DB instance? </li></ul><ul><li>How do we update a DB schema? </li></ul>
  13. 13. Data Base Design Issues (2) <ul><li>Unique keys </li></ul><ul><li>Any column can serve as search key </li></ul><ul><li>Superkey = unique record identifier </li></ul><ul><li>user-id and SSN for customer </li></ul><ul><li>tcode for product </li></ul><ul><li>Sometimes superkey = 2 or more keys </li></ul><ul><li>e.g.: nationality + passport-number </li></ul><ul><li>Candidate Key = minimal superkey = unique key </li></ul><ul><li>Update Used for cross-products and joins </li></ul>
  14. 14. Data Base Design Issues (3) <ul><li>Drops and errors </li></ul><ul><li>Missing data -- always happens </li></ul><ul><li>Erroneously entered data (type checking, range checking, consistency checking, ...) </li></ul>
  15. 15. Data Base Design Issues (4) <ul><li>Text Mining </li></ul><ul><li>Rows in T m,n are document vectors </li></ul><ul><li>n = vocabulary size = O(10 5 ) </li></ul><ul><li>m = documents = O(10 5 ) </li></ul><ul><li>T m,n is sparse </li></ul><ul><li>Same data type for every cell t i,j in T m,n </li></ul>
  16. 16. DATA MINING [Supervised] (1) <ul><li>Given: </li></ul><ul><li>A data base table T m,n </li></ul><ul><li>Predictor attributes: t j </li></ul><ul><li>Predicted attributes: t k (k # j) </li></ul><ul><li>Find Predictor Functions: </li></ul><ul><li>F k : t j --> t k , such that, for each k: </li></ul><ul><li>F k = Arg min Error[F l,k (t j ), t k ] </li></ul><ul><li>F l,k L 2 </li></ul><ul><li>(or L 1 , or L-infinity norm, ...) </li></ul>
  17. 17. DATA MINING [Supervised] (2) <ul><li>Where typically: </li></ul><ul><li>There is only one t k of interest and therefore only one F k ( t j ) </li></ul><ul><li>t k may be boolean </li></ul><ul><li>=> F k is a binary classifier </li></ul><ul><li>t k may be nominal (finite set) </li></ul><ul><li>=> F k is an n-ary classifier </li></ul><ul><li>t k may be a real number </li></ul><ul><li>=> t k is a an approximating function </li></ul><ul><li>t k may be an arbitrary string </li></ul><ul><li>=> t k is hard to formalize </li></ul>
  18. 18. DATA MINING APPLICATIONS (1) <ul><li>FINANCE: </li></ul><ul><li>Credit-card & Loan Fraud Detection </li></ul><ul><li>Time Series Investment Portfolio </li></ul><ul><li>Credit Decisions & Collections </li></ul><ul><li>HEALTHCARE: </li></ul><ul><li>Decision Support: optimal treatment choice </li></ul><ul><li>Survivability Predictions </li></ul><ul><li>medical facility utilization predictions </li></ul>
  19. 19. DATA MINING APPLICATIONS (2) <ul><li>MANUFACTURING: </li></ul><ul><li>Numerical Controller Optimizations </li></ul><ul><li>Factory Scheduling optimization </li></ul><ul><li>MARKETING & SALES: </li></ul><ul><li>Demographic Segmentation </li></ul><ul><li>Marketing Strategy Effectiveness </li></ul><ul><li>New Product Market Prediction </li></ul><ul><li>Market-basket analysis </li></ul>
  20. 20. Simple Data Mining Example (1) <ul><li>Tot Num Max Num </li></ul><ul><li>Acct. Income Job Delinq Delinq Owns Credit Final </li></ul><ul><li>numb. in K/yr Now? accts cycles home? years disp. </li></ul><ul><li>------------------------------------------------------------ </li></ul><ul><li>1001 25 Y 1 1 N 2 Y </li></ul><ul><li>1002 60 Y 3 2 Y 5 N </li></ul><ul><li>1003 ? N 0 0 N 2 N </li></ul><ul><li>1004 52 Y 1 2 N 9 Y </li></ul><ul><li>1005 75 Y 1 6 Y 3 Y </li></ul><ul><li>1006 29 Y 2 1 Y 1 N </li></ul><ul><li>1007 48 Y 6 4 Y 8 N </li></ul><ul><li>1008 80 Y 0 0 Y 0 Y </li></ul><ul><li>1009 31 Y 1 1 N 1 Y </li></ul><ul><li>1011 45 Y ? 0 ? 7 Y </li></ul><ul><li>1012 59 ? 2 4 N 2 N </li></ul><ul><li>1013 10 N 1 1 N 3 N </li></ul><ul><li>1014 51 Y 1 3 Y 1 Y </li></ul><ul><li>1015 65 N 1 2 N 8 Y </li></ul><ul><li>1016 20 N 0 0 N 0 N </li></ul><ul><li>1017 55 Y 2 3 N 2 N </li></ul><ul><li>1018 40 N 0 0 Y 1 Y </li></ul>
  21. 21. Simple Data Mining Example (2) <ul><li>Tot Num Max Num </li></ul><ul><li>Acct. Income Job Delinq Delinq Owns Credit Final </li></ul><ul><li>numb. in K/yr Now? accts cycles home? years disp. </li></ul><ul><li>------------------------------------------------------------ </li></ul><ul><li>1019 80 Y 1 1 Y 0 Y </li></ul><ul><li>1021 18 Y 0 0 N 4 Y </li></ul><ul><li>1022 53 Y 3 2 Y 5 N </li></ul><ul><li>1023 0 N 1 1 Y 3 N </li></ul><ul><li>1024 90 N 1 3 Y 1 Y </li></ul><ul><li>1025 51 Y 1 2 N 7 Y </li></ul><ul><li>1026 20 N 4 1 N 1 N </li></ul><ul><li>1027 32 Y 2 2 N 2 N </li></ul><ul><li>1028 40 Y 1 1 Y 1 Y </li></ul><ul><li>1029 31 Y 0 0 N 1 Y </li></ul><ul><li>1031 45 Y 2 1 Y 4 Y </li></ul><ul><li>1032 90 ? 3 4 ? ? N </li></ul><ul><li>1033 30 N 2 1 Y 2 N </li></ul><ul><li>1034 88 Y 1 2 Y 5 Y </li></ul><ul><li>1035 65 Y 1 4 N 5 Y </li></ul><ul><li>1036 12 N 1 1 N 1 N </li></ul>
  22. 22. Simple Data Mining Example (3) <ul><li>Tot Num Max Num </li></ul><ul><li>Acct. Income Job Delinq Delinq Owns Credit Final </li></ul><ul><li>numb. in K/yr Now? accts cycles home? years disp. </li></ul><ul><li>------------------------------------------------------------ </li></ul><ul><li>1037 28 Y 3 3 Y 2 N </li></ul><ul><li>1038 66 ? 0 0 ? ? Y </li></ul><ul><li>1039 50 Y 2 1 Y 1 Y </li></ul><ul><li>1041 ? Y 0 0 Y 8 Y </li></ul><ul><li>1042 51 N 3 4 Y 2 N </li></ul><ul><li>1043 20 N 0 0 N 2 N </li></ul><ul><li>1044 80 Y 1 3 Y 7 Y </li></ul><ul><li>1045 51 Y 1 2 N 4 Y </li></ul><ul><li>1046 22 ? ? ? N 0 N </li></ul><ul><li>1047 39 Y 3 2 ? 4 N </li></ul><ul><li>1048 70 Y 0 0 ? 1 Y </li></ul><ul><li>1049 40 Y 1 1 Y 1 Y </li></ul><ul><li>------------------------------------------------------------ </li></ul>
  23. 23. Trend Detection in DM (1) <ul><li>Example: Sales Prediction </li></ul><ul><li>2003 Q1 sales = 4.0M, </li></ul><ul><li>2003 Q2 sales = 3.5M </li></ul><ul><li>2003 Q3 sales = 3.0M </li></ul><ul><li>2003 Q4 sales = ?? </li></ul>
  24. 24. Trend Detection in DM (2) <ul><li>Now if we knew last year : </li></ul><ul><li>2002 Q1 sales = 3.5M, </li></ul><ul><li>2002 Q2 sales = 3.1M </li></ul><ul><li>2002 Q3 sales = 2,8M </li></ul><ul><li>2002 Q4 sales = 4.5M </li></ul><ul><li>And if we knew previous year : </li></ul><ul><li>2001 Q1 sales = 3.2M, </li></ul><ul><li>2001 Q2 sales = 2.9M </li></ul><ul><li>2001 Q3 sales = 2.5M </li></ul><ul><li>2001 Q4 sales = 3.7M </li></ul>
  25. 25. Trend Detection in DM (3) <ul><li>What will 2001 Q4 sales be? </li></ul><ul><li>What if Christmas 2000 was cancelled? </li></ul><ul><li>What will 2002 Q4 sales be? </li></ul>
  26. 26. Trend Detection in DM II (1) <ul><li>Methods </li></ul><ul><li>Numerical series extrapolation </li></ul><ul><li>Cyclical curve fitting </li></ul><ul><ul><li>Find period of cycle </li></ul></ul><ul><ul><li>Fit curve for each period </li></ul></ul><ul><li>(often with L 2 or L infinity norm) </li></ul><ul><ul><li>Find translation (series extrapolation) </li></ul></ul><ul><ul><li>Extrapolate to estimate desire values </li></ul></ul><ul><li>Preclassify data first </li></ul><ul><li>(e.g. &quot;recession&quot; and &quot;expansion&quot; years) </li></ul><ul><li>Combine with &quot;standard&quot; data mining </li></ul>
  27. 27. Trend Detection in DM II (2) <ul><li>Thorny Problems </li></ul><ul><li>How to use external knowledge* to make up for limitations in the data? </li></ul><ul><li>How to make longer-range extrapolations? </li></ul><ul><li>How to cope with corrupted data? </li></ul><ul><ul><li>Random point errors (easy) </li></ul></ul><ul><ul><li>Systematic error (hard) </li></ul></ul><ul><ul><li>Malicious errors (impossible) </li></ul></ul>
  28. 28. Methods for Supervised DM (1) <ul><li>Classifiers </li></ul><ul><li>Linear Separators (regression) </li></ul><ul><li>Naive Bayes (NB) </li></ul><ul><li>Decision Trees (DTs) </li></ul><ul><li>k-Nearest Neighbor (kNN) </li></ul><ul><li>Decision rule induction </li></ul><ul><li>Support Vector Machines (SVMs) </li></ul><ul><li>Neural Networks (NNs) ... </li></ul>
  29. 29. Methods for Supervised DM (2) <ul><li>Points of Comparison </li></ul><ul><li>Hard vs Soft decisions </li></ul><ul><li>(e.g. DTs and rules vs kNN, NB) </li></ul><ul><li>Human-interpretable decision rules </li></ul><ul><li>(best: rules, worst: NNs, SVMs) </li></ul><ul><li>Training data needed (less is better) </li></ul><ul><li>(best: kNNs, worst: NNs) </li></ul><ul><li>Graceful data-error tolerance </li></ul><ul><li>(best: NNs, kNNs, worst: rules) </li></ul>
  30. 30. Symbolic Rule Induction (1) <ul><li>General idea </li></ul><ul><li>Labeled instances are DB tuples </li></ul><ul><li>Rules are generalized tuples </li></ul><ul><li>Generalization occurs at term in tuple </li></ul><ul><li>Generalize on new E + not predicted </li></ul><ul><li>Specialize on new E - not predicted </li></ul><ul><li>Ignore predicted E + or E - </li></ul>
  31. 31. Symbolic Rule Induction (2) <ul><li>Example term generalizations </li></ul><ul><li>Constant => disjunction </li></ul><ul><li>e.g. if small portion value set seen </li></ul><ul><li>Constant => least-common-generalizer class </li></ul><ul><li>e.g. if large portion of value set seen </li></ul><ul><li>Number (or ordinal) => range </li></ul><ul><li>e.g. if dense sequential sampling </li></ul>
  32. 32. Symbolic Rule Induction (3) <ul><li>Example term specializations </li></ul><ul><li>class => disjunction of subclasses </li></ul><ul><li>Range => disjunction of sub-ranges </li></ul>
  33. 33. Symbolic Rule Induction Example (1) <ul><li>Age Gender Temp b-cult c-cult loc Skin disease </li></ul><ul><li>65 M 101 + .23 USA normal strep </li></ul><ul><li>25 M 102 + .00 CAN normal strep </li></ul><ul><li>65 M 102 - .78 BRA rash dengue </li></ul><ul><li>36 F 99 - .19 USA normal *none* </li></ul><ul><li>11 F 103 + .23 USA flush strep </li></ul><ul><li>88 F 98 + .21 CAN normal *none* </li></ul><ul><li>39 F 100 + .10 BRA normal strep </li></ul><ul><li>12 M 101 + .00 BRA normal strep </li></ul><ul><li>15 F 101 + .66 BRA flush dengue </li></ul><ul><li>20 F 98 + .00 USA rash *none* </li></ul><ul><li>81 M 98 - .99 BRA rash ec-12 </li></ul><ul><li>87 F 100 - .89 USA rash ec-12 </li></ul><ul><li>12 F 102 + ?? CAN normal strep </li></ul><ul><li>14 F 101 + .33 USA normal </li></ul><ul><li>67 M 102 + .77 BRA rash </li></ul>
  34. 34. Symbolic Rule Induction Example (2) <ul><li>Candidate Rules: </li></ul><ul><li>IF age = [12,65] </li></ul><ul><li>gender = *any* </li></ul><ul><li>temp = [100,103] </li></ul><ul><li>b-cult = + </li></ul><ul><li>c-cult = [.00,.23] </li></ul><ul><li>loc = *any* </li></ul><ul><li>skin = (normal,flush) </li></ul><ul><li>THEN: strep </li></ul><ul><li>IF age = (15,65) </li></ul><ul><li>gender = *any* </li></ul><ul><li>temp = [101,102] </li></ul><ul><li>b-cult = *any* </li></ul><ul><li>c-cult = [.66,.78] </li></ul><ul><li>loc = BRA </li></ul><ul><li>skin = rash </li></ul><ul><li>THEN: dengue </li></ul>Disclaimer: These are *not* real medical records
  35. 35. Evaluation of ML/DM Methods <ul><li>Split labeled data into training & test sets </li></ul><ul><li>Apply ML (d-tree, rules, NB, …) to training </li></ul><ul><li>Measure accuracy (or P, R, F 1 , …) on test </li></ul><ul><li>Alternatives: </li></ul><ul><ul><li>K-fold cross-validation </li></ul></ul><ul><ul><li>Jacknifing (aka “leave one out”) </li></ul></ul><ul><li>Caveat: distributional equivalence </li></ul><ul><li>Problem: temporally-sequenced data (drift) </li></ul>

×