Introduction to Machine Learning and Data Mining

566 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
566
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Introduction to Machine Learning and Data Mining

  1. 1. Introduction to Machine Learning and Data Mining Prof. Carla Brodley Computer Science Tufts University Fall 2009 1 Course Overview   Syllabus   Goals   Evaluation   Deadlines Machine Learning Carla Brodley, Tufts University 2 1
  2. 2. Course Objectives   The goal of this course is to introduce students to current machine learning and data mining methods. It is intended to prepare students for upper-level courses and to give them the knowledge to apply machine learning/data mining to science, medicine and engineering. In particular students will gain: •  A general background in the state of the art in ML •  Experience in how to conduct experiments and evaluate learning performance •  Knowledge of how to use and extend current publicly available packages •  An introduction to reading research papers Machine Learning Carla Brodley, Tufts University 3 Tom Mitchell’s Definition of Learning   A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.   Example 1: Learn to play checkers •  T: play checkers and win •  P: % of games won in the world tournament •  E: opportunity to play against self.   Example 2: Learn to detect SPAM •  T: Distinguish between SPAM and HAM •  P: % of emails correctly classified •  E: Labeled emails from your friend Robin Machine Learning Carla Brodley, Tufts University 4 2
  3. 3. Knowledge Discovery in Databases   Theprocess of extracting valid previously unknown and ultimately comprehensible information from large databases Machine Learning Carla Brodley, Tufts University 5 What is Data Mining? Figure is from Fayyad, Piatestsky-Shapiro, Smyth, and Uthurusamy Advances in Knowledge Discovery and Data Mining, 1996. Machine Learning Carla Brodley, Tufts University 6 3
  4. 4. Learning from Data   Supervised Learning: each example has a label (discrete or continuous)   Reinforcement Learning: feedback after a sequence of actions/decisions   UnsupervisedLearning: no feedback, goal is to group data into similar groups Machine Learning Carla Brodley, Tufts University 7 Supervised Learning: Classification   Given a set of examples (training data), each described by a set of attributes, and labeled with a class   Find a model for the class attribute as a function of the values of other attributes   Goal: classify previously unseen data accurately Machine Learning Carla Brodley, Tufts University 8 4
  5. 5. Classification Example al al s ric ric uou o o t eg teg ntin s ca ca co as cl Tid Refund Marital Taxable Refund Marital Taxable Status Income Cheat Status Income Cheat 1 Yes Single 125K No No Single 75K ? 2 No Married 100K No Yes Married 50K ? 3 No Single 70K No No Married 150K ? 4 Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married 60K No No Married 80K ? Test 7 Yes Divorced 220K No 10 Set 8 No Single 85K Yes 9 No Married 75K No Learn Training 10 No Single 90K Yes Model 10 Set Classifier Adapted from slides by Tan, Steinbach and Kumar Machine Learning Carla Brodley, Tufts University 9 Classification Applications: Fraud Detection: Predict fraudulent cases in credit card transactions for a particular account •  Training data: previous transactions of a particular account holder •  Attributes: time of purchase, product type, cost, location, etc •  Class: Label transactions as fraud or fair Telephone Operator Support: determine whether and arbitrary caller said “yes” or “no”. •  Training data: Signal of caller’s voice accepting or declining an offer. •  Attributes: features computed from the signal •  Class: Label each signal as “yes” or “no” Machine Learning Carla Brodley, Tufts University 10 5
  6. 6. Supervised Learning: Regression   Predict a value of a given continuous valued variable based on the values of the attributes   Well studied in statistics, neural networks, recent focus in Machine Learning is on non-linear models using SVMs   Examples: •  Predicting sales amounts of new product based on advertising expenditure •  Predicting your score on Netflix •  Time series prediction of stock market indices Machine Learning Carla Brodley, Tufts University 11 Unsupervised Learning: Clustering   Given a set of data points, each described by a set of attributes, find clusters such that: F1 xx •  Inter-cluster similarity is xxxx x x maximized xx xxxx x x xxx •  Intra-cluster similarity is minimized F2   Requires the definition of a similarity measure Machine Learning Carla Brodley, Tufts University 12 6
  7. 7. Clustering Applications Goal: divide customers into distinct groups based on behavior or demographics, with the goal of selecting a marketing target •  Training data: Use detailed record of transactions, web behavior, demographics, etc •  Attributes: Web pages visited, call frequency, length of call, financial status, marital status, size of investment, etc.   Online recommender systems (Netflix, Amazon, Perseus Digital Library) Machine Learning Carla Brodley, Tufts University 13 Clustering Application: Energy Use Profiles Goal: identify similar energy-use customer profiles to improve billing scheme •  Training data: energy use profiles of commercial customers •  Attributes: time series of energy usage Cust 12:00 1:00 … 1 45.5 65.2 … 2 34.2 76.3 … Examine customer demographics of each cluster Machine Learning Carla Brodley, Tufts University 14 7
  8. 8. What types of data are there?   What types of features describe each example? •  Discrete: town that you live in (e.g., Somerville, Medford, Boston, Cambridge) •  Continuous: salary •  Ordinal: age •  Relational: sister of   How are data points related? •  Independent: each represents a different student •  Not independent: financial indicators for a particular day are related to the previous day Machine Learning Carla Brodley, Tufts University 15 Classification: Example Dataset Age Education Marital Status Race Gender Status 39 Bachelors Never-married White … Male Poor 50 Bachelors Married White … Male Poor 38 HS-grad Divorced White … Male Poor 53 11th Married Black … Male Poor 28 Bachelors Married Black … Female Poor 37 Masters Married White … Female Poor 52 HS-grad Married White … Male Rich 31 Masters Never-married White … Female Rich 42 Bachelors Married White … Male Rich 37 Some-college Married Black … Male Rich 30 Bachelors Married Asian … Male Rich 23 Bachelors Never-married White … Female Poor 32 Assoc-acdm Never-married Black … Male Poor 40 Assoc-voc Married Asian … Male Rich Machine Learning Carla Brodley, Tufts University 16 8
  9. 9. Classification Application: Census Data   Given a set of examples (census data from 1990), each described by a set of attributes, and labeled as either {rich or poor}   Two types of attributes: •  Categorical: attributes that take on one of a set of values (e.g., race, marital status) •  Numeric: real-valued attribute   Find a model for the class attribute (wealth) as a function of the values of other attributes (employment, marital status, education level, age, …)   Goal: predict the wealth of people not in the training data Machine Learning Carla Brodley, Tufts University 17 Appropriate Applications for Supervised Learning   Situations in which there is no human expert   Situations where a human can perform the task but not how they do it   Situations where the desired function is changing frequently   Situations where each user needs a customized function Machine Learning Carla Brodley, Tufts University 18 9
  10. 10. An Example Learning Problem Inst. X1 X2 X3 X4 y 1 0 0 1 0 0 2 0 1 0 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0 Machine Learning Carla Brodley, Tufts University 19 Machine Learning Carla Brodley, Tufts University 20 10
  11. 11. Machine Learning Carla Brodley, Tufts University 21 Machine Learning Carla Brodley, Tufts University 22 11
  12. 12. Machine Learning Carla Brodley, Tufts University 23 Machine Learning Carla Brodley, Tufts University 24 12
  13. 13. Classification k-Nearest Neighbor o oo o oo oo oo xxxx x x xxx ? Machine Learning Carla Brodley, Tufts University 25 Classification k-Nearest Neighbor ? o oo o oo oo oo xxxx x x xxx Machine Learning Carla Brodley, Tufts University 26 13
  14. 14. Classification k-Nearest Neighbor o oo oo o o oo ?x o xxxx x xxx Assign majority class of the k nearest neighbors Machine Learning Carla Brodley, Tufts University 27 Real World Issues and k-NN   Non-uniform costs?   Missing values?   Noise in class label? Machine Learning Carla Brodley, Tufts University 28 14
  15. 15. Real World Issues and k-NN   Non-uniform costs? •  Weight votes by cost   Missing values? •  Take mean or class mean   Noise in class label? •  Increase k Machine Learning Carla Brodley, Tufts University 29 k-Nearest Neighbor Issues Computation: must look at distance of query to every point Choosing k Effect of outliers and noise Euclidean distance metric - requires normalization - problems in high dimensions - treats all features as equally important Machine Learning Carla Brodley, Tufts University 30 15

×