Introduction to Machine Learning and Data Mining - Carla Brodley, Tufts University
1. Introduction to Machine
Learning and Data Mining
Prof. Carla Brodley
Computer Science
Tufts University
Fall 2009
1
Course Overview
Syllabus
Goals
Evaluation
Deadlines
Machine Learning Carla Brodley, Tufts University 2
1
2. Course Objectives
The goal of this course is to introduce students to current
machine learning and data mining methods. It is intended
to prepare students for upper-level courses and to give
them the knowledge to apply machine learning/data
mining to science, medicine and engineering. In
particular students will gain:
• A general background in the state of the art in ML
• Experience in how to conduct experiments and
evaluate learning performance
• Knowledge of how to use and extend current publicly
available packages
• An introduction to reading research papers
Machine Learning Carla Brodley, Tufts University 3
Tom Mitchell’s Definition of Learning
A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E.
Example 1: Learn to play checkers
• T: play checkers and win
• P: % of games won in the world tournament
• E: opportunity to play against self.
Example 2: Learn to detect SPAM
• T: Distinguish between SPAM and HAM
• P: % of emails correctly classified
• E: Labeled emails from your friend Robin
Machine Learning Carla Brodley, Tufts University 4
2
3. Knowledge Discovery in Databases
Theprocess of extracting valid previously
unknown and ultimately comprehensible
information from large databases
Machine Learning Carla Brodley, Tufts University 5
What is Data Mining?
Figure is from Fayyad, Piatestsky-Shapiro, Smyth, and Uthurusamy
Advances in Knowledge Discovery and Data Mining, 1996.
Machine Learning Carla Brodley, Tufts University 6
3
4. Learning from Data
Supervised Learning: each example has a label
(discrete or continuous)
Reinforcement
Learning: feedback after a
sequence of actions/decisions
UnsupervisedLearning: no feedback, goal is to
group data into similar groups
Machine Learning Carla Brodley, Tufts University 7
Supervised Learning:
Classification
Given
a set of examples (training data), each
described by a set of attributes, and labeled with
a class
Find
a model for the class attribute as a function
of the values of other attributes
Goal: classify previously unseen data accurately
Machine Learning Carla Brodley, Tufts University 8
4
5. Classification Example
al al s
ric ric uou
o o
t eg teg ntin s
ca ca co as
cl
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat
1 Yes Single 125K No No Single 75K ?
2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
7 Yes Divorced 220K No
10
Set
8 No Single 85K Yes
9 No Married 75K No
Learn
Training
10 No Single 90K Yes Model
10
Set Classifier
Adapted from slides by Tan, Steinbach and Kumar
Machine Learning Carla Brodley, Tufts University 9
Classification Applications:
Fraud Detection: Predict fraudulent cases in credit card
transactions for a particular account
• Training data: previous transactions of a particular account
holder
• Attributes: time of purchase, product type, cost, location, etc
• Class: Label transactions as fraud or fair
Telephone Operator Support: determine whether and
arbitrary caller said “yes” or “no”.
• Training data: Signal of caller’s voice accepting or declining an
offer.
• Attributes: features computed from the signal
• Class: Label each signal as “yes” or “no”
Machine Learning Carla Brodley, Tufts University 10
5
6. Supervised Learning:
Regression
Predict
a value of a given continuous valued
variable based on the values of the attributes
Well
studied in statistics, neural networks, recent
focus in Machine Learning is on non-linear
models using SVMs
Examples:
• Predicting sales amounts of new product based on
advertising expenditure
• Predicting your score on Netflix
• Time series prediction of stock market indices
Machine Learning Carla Brodley, Tufts University 11
Unsupervised Learning:
Clustering
Given a set of data points, each described by a
set of attributes, find clusters such that:
F1 xx
• Inter-cluster similarity is
xxxx x
x
maximized xx
xxxx
x
x xxx
• Intra-cluster similarity is
minimized
F2
Requires the definition of a similarity measure
Machine Learning Carla Brodley, Tufts University 12
6
7. Clustering Applications
Goal: divide customers into distinct groups based
on behavior or demographics, with the goal of
selecting a marketing target
• Training data: Use detailed record of transactions,
web behavior, demographics, etc
• Attributes: Web pages visited, call frequency, length
of call, financial status, marital status, size of
investment, etc.
Online
recommender systems (Netflix, Amazon,
Perseus Digital Library)
Machine Learning Carla Brodley, Tufts University 13
Clustering Application:
Energy Use Profiles
Goal: identify similar energy-use customer
profiles to improve billing scheme
• Training data: energy use profiles of commercial
customers
• Attributes: time series of energy usage
Cust 12:00 1:00 …
1 45.5 65.2 …
2 34.2 76.3 …
Examine customer demographics of each cluster
Machine Learning Carla Brodley, Tufts University 14
7
8. What types of data are there?
What types of features describe each example?
• Discrete: town that you live in (e.g., Somerville,
Medford, Boston, Cambridge)
• Continuous: salary
• Ordinal: age
• Relational: sister of
How are data points related?
• Independent: each represents a different student
• Not independent: financial indicators for a particular day
are related to the previous day
Machine Learning Carla Brodley, Tufts University 15
Classification:
Example Dataset
Age Education Marital Status Race Gender Status
39 Bachelors Never-married White … Male Poor
50 Bachelors Married White … Male Poor
38 HS-grad Divorced White … Male Poor
53 11th Married Black … Male Poor
28 Bachelors Married Black … Female Poor
37 Masters Married White … Female Poor
52 HS-grad Married White … Male Rich
31 Masters Never-married White … Female Rich
42 Bachelors Married White … Male Rich
37 Some-college Married Black … Male Rich
30 Bachelors Married Asian … Male Rich
23 Bachelors Never-married White … Female Poor
32 Assoc-acdm Never-married Black … Male Poor
40 Assoc-voc Married Asian … Male Rich
Machine Learning Carla Brodley, Tufts University 16
8
9. Classification Application:
Census Data
Given a set of examples (census data from 1990), each
described by a set of attributes, and labeled as either {rich
or poor}
Two types of attributes:
• Categorical: attributes that take on one of a set of values (e.g.,
race, marital status)
• Numeric: real-valued attribute
Find a model for the class attribute (wealth) as a function
of the values of other attributes (employment, marital
status, education level, age, …)
Goal: predict the wealth of people not in the training data
Machine Learning Carla Brodley, Tufts University 17
Appropriate Applications for
Supervised Learning
Situations in which there is no human expert
Situations where a human can perform the task
but not how they do it
Situations where the desired function is changing
frequently
Situations where each user needs a customized
function
Machine Learning Carla Brodley, Tufts University 18
9
11. Machine Learning Carla Brodley, Tufts University 21
Machine Learning Carla Brodley, Tufts University 22
11
12. Machine Learning Carla Brodley, Tufts University 23
Machine Learning Carla Brodley, Tufts University 24
12
13. Classification
k-Nearest Neighbor
o
oo o
oo
oo
oo
xxxx
x
x xxx
?
Machine Learning Carla Brodley, Tufts University 25
Classification
k-Nearest Neighbor
?
o
oo o
oo
oo
oo
xxxx
x
x xxx
Machine Learning Carla Brodley, Tufts University 26
13
14. Classification
k-Nearest Neighbor
o
oo
oo
o o
oo
?x
o xxxx
x xxx
Assign majority class of the k nearest neighbors
Machine Learning Carla Brodley, Tufts University 27
Real World Issues and k-NN
Non-uniform costs?
Missing values?
Noise in class label?
Machine Learning Carla Brodley, Tufts University 28
14
15. Real World Issues and k-NN
Non-uniform costs?
• Weight votes by cost
Missing values?
• Take mean or class mean
Noise in class label?
• Increase k
Machine Learning Carla Brodley, Tufts University 29
k-Nearest Neighbor Issues
Computation: must look at distance of query to
every point
Choosing k
Effect of outliers and noise
Euclidean distance metric
- requires normalization
- problems in high dimensions
- treats all features as equally important
Machine Learning Carla Brodley, Tufts University 30
15