SlideShare a Scribd company logo
1 of 15
Download to read offline
Introduction to Machine
   Learning and Data Mining


                            Prof. Carla Brodley
                             Computer Science
                                 Tufts University




                                                      Fall 2009
                                                                  1




Course Overview


  Syllabus

  Goals

  Evaluation

  Deadlines




 Machine Learning   Carla Brodley, Tufts University                   2




                                                                          1
Course Objectives
    The goal of this course is to introduce students to current
     machine learning and data mining methods. It is intended
     to prepare students for upper-level courses and to give
     them the knowledge to apply machine learning/data
     mining to science, medicine and engineering. In
     particular students will gain:


       •  A general background in the state of the art in ML
       •  Experience in how to conduct experiments and
          evaluate learning performance
       •  Knowledge of how to use and extend current publicly
          available packages
       •  An introduction to reading research papers
     Machine Learning            Carla Brodley, Tufts University   3




Tom Mitchell’s Definition of Learning

    A computer program is said to learn from experience E
     with respect to some class of tasks T and performance
     measure P, if its performance at tasks in T, as measured
     by P, improves with experience E.


    Example 1: Learn to play checkers
       •  T: play checkers and win
       •  P: % of games won in the world tournament
       •  E: opportunity to play against self.


    Example 2: Learn to detect SPAM
       •  T: Distinguish between SPAM and HAM
       •  P: % of emails correctly classified
       •  E: Labeled emails from your friend Robin
     Machine Learning            Carla Brodley, Tufts University   4




                                                                       2
Knowledge Discovery in Databases

  Theprocess of extracting valid previously
 unknown and ultimately comprehensible
 information from large databases




 Machine Learning                          Carla Brodley, Tufts University             5




What is Data Mining?




                    Figure is from Fayyad, Piatestsky-Shapiro, Smyth, and Uthurusamy
                    Advances in Knowledge Discovery and Data Mining, 1996.

 Machine Learning                          Carla Brodley, Tufts University             6




                                                                                           3
Learning from Data

  Supervised Learning: each example has a label
 (discrete or continuous)
  Reinforcement
               Learning: feedback after a
 sequence of actions/decisions
  UnsupervisedLearning: no feedback, goal is to
 group data into similar groups




 Machine Learning                Carla Brodley, Tufts University   7




Supervised Learning:
Classification

  Given
       a set of examples (training data), each
 described by a set of attributes, and labeled with
 a class
  Find
      a model for the class attribute as a function
 of the values of other attributes


  Goal:            classify previously unseen data accurately



 Machine Learning                Carla Brodley, Tufts University   8




                                                                       4
Classification Example
                                al            al             s
                             ric         ric            uou
                         o              o
                    t eg            teg            ntin           s
                 ca              ca            co              as
                                                             cl
     Tid Refund Marital                 Taxable                                      Refund Marital       Taxable
                Status                  Income Cheat                                        Status        Income Cheat

     1    Yes           Single          125K           No                            No        Single     75K    ?
     2    No            Married         100K           No                            Yes       Married    50K    ?
     3    No            Single          70K            No                            No        Married    150K   ?
     4    Yes           Married         120K           No                            Yes       Divorced 90K      ?
     5    No            Divorced 95K                   Yes                           No        Single     40K    ?
     6    No            Married         60K            No                            No        Married    80K    ?        Test
     7    Yes           Divorced 220K                  No
                                                                                10




                                                                                                                          Set

     8    No            Single          85K            Yes
     9    No            Married         75K            No
                                                                                                           Learn
                                                                          Training
     10   No            Single          90K            Yes                                                               Model
10




                                                                            Set                          Classifier

                                            Adapted from slides by Tan, Steinbach and Kumar
          Machine Learning                                            Carla Brodley, Tufts University                             9




         Classification Applications:

             Fraud Detection: Predict fraudulent cases in credit card
             transactions for a particular account
               •  Training data: previous transactions of a particular account
                  holder
               •  Attributes: time of purchase, product type, cost, location, etc
               •  Class: Label transactions as fraud or fair


             Telephone Operator Support: determine whether and
             arbitrary caller said “yes” or “no”.
               •  Training data: Signal of caller’s voice accepting or declining an
                  offer.
               •  Attributes: features computed from the signal
               •  Class: Label each signal as “yes” or “no”



          Machine Learning                                            Carla Brodley, Tufts University                            10




                                                                                                                                      5
Supervised Learning:
Regression
  Predict
         a value of a given continuous valued
 variable based on the values of the attributes
  Well
      studied in statistics, neural networks, recent
 focus in Machine Learning is on non-linear
 models using SVMs
  Examples:
   •  Predicting sales amounts of new product based on
      advertising expenditure
   •  Predicting your score on Netflix
   •  Time series prediction of stock market indices


 Machine Learning             Carla Brodley, Tufts University                    11




Unsupervised Learning:
Clustering
  Given a set of data points, each described by a
  set of attributes, find clusters such that:

                                                          F1        xx
   •  Inter-cluster similarity is
                                                                xxxx x
                                                                  x
      maximized                                                  xx
                                                                         xxxx
                                                                            x
                                                                         x xxx
   •  Intra-cluster similarity is
      minimized
                                                                            F2
  Requires         the definition of a similarity measure


 Machine Learning             Carla Brodley, Tufts University                    12




                                                                                      6
Clustering Applications
 Goal: divide customers into distinct groups based
 on behavior or demographics, with the goal of
 selecting a marketing target

   •  Training data: Use detailed record of transactions,
      web behavior, demographics, etc
   •  Attributes: Web pages visited, call frequency, length
      of call, financial status, marital status, size of
      investment, etc.


  Online
       recommender systems (Netflix, Amazon,
 Perseus Digital Library)

 Machine Learning                  Carla Brodley, Tufts University   13




Clustering Application:
Energy Use Profiles
 Goal: identify similar energy-use customer
 profiles to improve billing scheme

   •  Training data: energy use profiles of commercial
      customers
   •  Attributes: time series of energy usage
             Cust   12:00   1:00         …
             1      45.5    65.2         …
             2      34.2    76.3         …



 Examine customer demographics of each cluster
 Machine Learning                  Carla Brodley, Tufts University   14




                                                                          7
What types of data are there?

  What             types of features describe each example?
   •  Discrete: town that you live in (e.g., Somerville,
      Medford, Boston, Cambridge)
   •  Continuous: salary
   •  Ordinal: age
   •  Relational: sister of


  How         are data points related?
   •  Independent: each represents a different student
   •  Not independent: financial indicators for a particular day
      are related to the previous day

 Machine Learning                       Carla Brodley, Tufts University                         15




Classification:
Example Dataset

     Age            Education      Marital Status               Race          Gender   Status
       39           Bachelors      Never-married                White     …   Male     Poor
       50           Bachelors      Married                      White     …   Male     Poor
       38           HS-grad        Divorced                     White     …   Male     Poor
       53           11th           Married                      Black     …   Male     Poor
       28           Bachelors      Married                      Black     …   Female   Poor
       37           Masters        Married                      White     …   Female   Poor
       52           HS-grad        Married                      White     …   Male     Rich
       31           Masters        Never-married                White     …   Female   Rich
       42           Bachelors      Married                      White     …   Male     Rich
       37           Some-college   Married                      Black     …   Male     Rich
       30           Bachelors      Married                      Asian     …   Male     Rich
       23           Bachelors      Never-married                White     …   Female   Poor
       32           Assoc-acdm     Never-married                Black     …   Male     Poor
       40           Assoc-voc      Married                      Asian     …   Male     Rich




 Machine Learning                       Carla Brodley, Tufts University                         16




                                                                                                     8
Classification Application:
Census Data
    Given a set of examples (census data from 1990), each
     described by a set of attributes, and labeled as either {rich
     or poor}
    Two types of attributes:
       •  Categorical: attributes that take on one of a set of values (e.g.,
          race, marital status)
       •  Numeric: real-valued attribute
    Find a model for the class attribute (wealth) as a function
     of the values of other attributes (employment, marital
     status, education level, age, …)
    Goal: predict the wealth of people not in the training data



     Machine Learning              Carla Brodley, Tufts University             17




Appropriate Applications for
Supervised Learning

  Situations           in which there is no human expert
  Situations  where a human can perform the task
     but not how they do it
  Situations           where the desired function is changing
     frequently
  Situations           where each user needs a customized
     function



     Machine Learning              Carla Brodley, Tufts University             18




                                                                                    9
An Example Learning Problem

                   Inst.   X1   X2          X3                 X4   y
                   1       0    0           1                  0    0
                   2       0    1           0                  0    0
                   3       0    0           1                  1    1
                   4       1    0           0                  1    1
                   5       0    1           1                  0    0
                   6       1    1           0                  0    0
                   7       0    1           0                  1    0



Machine Learning                 Carla Brodley, Tufts University        19




Machine Learning                 Carla Brodley, Tufts University        20




                                                                             10
Machine Learning   Carla Brodley, Tufts University   21




Machine Learning   Carla Brodley, Tufts University   22




                                                          11
Machine Learning   Carla Brodley, Tufts University   23




Machine Learning   Carla Brodley, Tufts University   24




                                                          12
Classification
k-Nearest Neighbor


                     o
                   oo o
                   oo
                    oo
                   oo


                                           xxxx
                                              x
                                           x xxx
                                         ?




Machine Learning   Carla Brodley, Tufts University      25




Classification
k-Nearest Neighbor

                                  ?
                     o
                   oo o
                   oo
                    oo
                   oo


                                                xxxx
                                                   x
                                                x xxx




Machine Learning   Carla Brodley, Tufts University      26




                                                             13
Classification
k-Nearest Neighbor


                                    o
                                 oo
                                 oo
                                  o                 o
                                    oo
                                                   ?x
                                                   o             xxxx
                                                                 x xxx



     Assign majority class of the k nearest neighbors


 Machine Learning                   Carla Brodley, Tufts University      27




Real World Issues and k-NN

  Non-uniform             costs?


  Missing           values?


  Noise            in class label?




 Machine Learning                   Carla Brodley, Tufts University      28




                                                                              14
Real World Issues and k-NN

  Non-uniform             costs?
   •  Weight votes by cost
  Missing           values?
   •  Take mean or class mean
  Noise            in class label?
   •  Increase k




 Machine Learning                   Carla Brodley, Tufts University   29




k-Nearest Neighbor Issues
  Computation: must look at distance of query to
  every point

  Choosing k

  Effect of outliers and noise

  Euclidean distance metric
   - requires normalization
   - problems in high dimensions
   - treats all features as equally important
 Machine Learning                   Carla Brodley, Tufts University   30




                                                                           15

More Related Content

Viewers also liked

msword
mswordmsword
mswordbutest
 
The Realization of Agent-Based E-mail automatic Handling System
The Realization of Agent-Based E-mail automatic Handling SystemThe Realization of Agent-Based E-mail automatic Handling System
The Realization of Agent-Based E-mail automatic Handling Systembutest
 
Bondec - A Sentence Boundary Detector
Bondec - A Sentence Boundary DetectorBondec - A Sentence Boundary Detector
Bondec - A Sentence Boundary Detectorbutest
 
NatashaBME1450.doc
NatashaBME1450.docNatashaBME1450.doc
NatashaBME1450.docbutest
 
Clayton State University
Clayton State UniversityClayton State University
Clayton State Universitybutest
 
Motivated Machine Learning for Water Resource Management
Motivated Machine Learning for Water Resource ManagementMotivated Machine Learning for Water Resource Management
Motivated Machine Learning for Water Resource Managementbutest
 
GoOpen 2010: Anne Cathrine Frøstrup
GoOpen 2010: Anne Cathrine FrøstrupGoOpen 2010: Anne Cathrine Frøstrup
GoOpen 2010: Anne Cathrine FrøstrupFriprogsenteret
 
Optimizing Intelligent Agents Constraint Satisfaction with ...
Optimizing Intelligent Agents Constraint Satisfaction with ...Optimizing Intelligent Agents Constraint Satisfaction with ...
Optimizing Intelligent Agents Constraint Satisfaction with ...butest
 
Missing Data Problems in Machine Learning
Missing Data Problems in Machine LearningMissing Data Problems in Machine Learning
Missing Data Problems in Machine Learningbutest
 
Full paper
Full paperFull paper
Full paperbutest
 
培训原理
培训原理培训原理
培训原理Bruce Yee
 
Mahesh Joshi
Mahesh JoshiMahesh Joshi
Mahesh Joshibutest
 
CYT Credential Aug, 2012
CYT Credential Aug, 2012CYT Credential Aug, 2012
CYT Credential Aug, 2012Kevin Tran, MBA
 
[doc].doc
[doc].doc[doc].doc
[doc].docbutest
 
report2.doc
report2.docreport2.doc
report2.docbutest
 
This is a heavily data-oriented
This is a heavily data-orientedThis is a heavily data-oriented
This is a heavily data-orientedbutest
 

Viewers also liked (17)

msword
mswordmsword
msword
 
The Realization of Agent-Based E-mail automatic Handling System
The Realization of Agent-Based E-mail automatic Handling SystemThe Realization of Agent-Based E-mail automatic Handling System
The Realization of Agent-Based E-mail automatic Handling System
 
Bondec - A Sentence Boundary Detector
Bondec - A Sentence Boundary DetectorBondec - A Sentence Boundary Detector
Bondec - A Sentence Boundary Detector
 
NatashaBME1450.doc
NatashaBME1450.docNatashaBME1450.doc
NatashaBME1450.doc
 
Clayton State University
Clayton State UniversityClayton State University
Clayton State University
 
Motivated Machine Learning for Water Resource Management
Motivated Machine Learning for Water Resource ManagementMotivated Machine Learning for Water Resource Management
Motivated Machine Learning for Water Resource Management
 
GoOpen 2010: Anne Cathrine Frøstrup
GoOpen 2010: Anne Cathrine FrøstrupGoOpen 2010: Anne Cathrine Frøstrup
GoOpen 2010: Anne Cathrine Frøstrup
 
Optimizing Intelligent Agents Constraint Satisfaction with ...
Optimizing Intelligent Agents Constraint Satisfaction with ...Optimizing Intelligent Agents Constraint Satisfaction with ...
Optimizing Intelligent Agents Constraint Satisfaction with ...
 
Missing Data Problems in Machine Learning
Missing Data Problems in Machine LearningMissing Data Problems in Machine Learning
Missing Data Problems in Machine Learning
 
Full paper
Full paperFull paper
Full paper
 
Futbola
FutbolaFutbola
Futbola
 
培训原理
培训原理培训原理
培训原理
 
Mahesh Joshi
Mahesh JoshiMahesh Joshi
Mahesh Joshi
 
CYT Credential Aug, 2012
CYT Credential Aug, 2012CYT Credential Aug, 2012
CYT Credential Aug, 2012
 
[doc].doc
[doc].doc[doc].doc
[doc].doc
 
report2.doc
report2.docreport2.doc
report2.doc
 
This is a heavily data-oriented
This is a heavily data-orientedThis is a heavily data-oriented
This is a heavily data-oriented
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Introduction to Machine Learning and Data Mining - Carla Brodley, Tufts University

  • 1. Introduction to Machine Learning and Data Mining Prof. Carla Brodley Computer Science Tufts University Fall 2009 1 Course Overview   Syllabus   Goals   Evaluation   Deadlines Machine Learning Carla Brodley, Tufts University 2 1
  • 2. Course Objectives   The goal of this course is to introduce students to current machine learning and data mining methods. It is intended to prepare students for upper-level courses and to give them the knowledge to apply machine learning/data mining to science, medicine and engineering. In particular students will gain: •  A general background in the state of the art in ML •  Experience in how to conduct experiments and evaluate learning performance •  Knowledge of how to use and extend current publicly available packages •  An introduction to reading research papers Machine Learning Carla Brodley, Tufts University 3 Tom Mitchell’s Definition of Learning   A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.   Example 1: Learn to play checkers •  T: play checkers and win •  P: % of games won in the world tournament •  E: opportunity to play against self.   Example 2: Learn to detect SPAM •  T: Distinguish between SPAM and HAM •  P: % of emails correctly classified •  E: Labeled emails from your friend Robin Machine Learning Carla Brodley, Tufts University 4 2
  • 3. Knowledge Discovery in Databases   Theprocess of extracting valid previously unknown and ultimately comprehensible information from large databases Machine Learning Carla Brodley, Tufts University 5 What is Data Mining? Figure is from Fayyad, Piatestsky-Shapiro, Smyth, and Uthurusamy Advances in Knowledge Discovery and Data Mining, 1996. Machine Learning Carla Brodley, Tufts University 6 3
  • 4. Learning from Data   Supervised Learning: each example has a label (discrete or continuous)   Reinforcement Learning: feedback after a sequence of actions/decisions   UnsupervisedLearning: no feedback, goal is to group data into similar groups Machine Learning Carla Brodley, Tufts University 7 Supervised Learning: Classification   Given a set of examples (training data), each described by a set of attributes, and labeled with a class   Find a model for the class attribute as a function of the values of other attributes   Goal: classify previously unseen data accurately Machine Learning Carla Brodley, Tufts University 8 4
  • 5. Classification Example al al s ric ric uou o o t eg teg ntin s ca ca co as cl Tid Refund Marital Taxable Refund Marital Taxable Status Income Cheat Status Income Cheat 1 Yes Single 125K No No Single 75K ? 2 No Married 100K No Yes Married 50K ? 3 No Single 70K No No Married 150K ? 4 Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married 60K No No Married 80K ? Test 7 Yes Divorced 220K No 10 Set 8 No Single 85K Yes 9 No Married 75K No Learn Training 10 No Single 90K Yes Model 10 Set Classifier Adapted from slides by Tan, Steinbach and Kumar Machine Learning Carla Brodley, Tufts University 9 Classification Applications: Fraud Detection: Predict fraudulent cases in credit card transactions for a particular account •  Training data: previous transactions of a particular account holder •  Attributes: time of purchase, product type, cost, location, etc •  Class: Label transactions as fraud or fair Telephone Operator Support: determine whether and arbitrary caller said “yes” or “no”. •  Training data: Signal of caller’s voice accepting or declining an offer. •  Attributes: features computed from the signal •  Class: Label each signal as “yes” or “no” Machine Learning Carla Brodley, Tufts University 10 5
  • 6. Supervised Learning: Regression   Predict a value of a given continuous valued variable based on the values of the attributes   Well studied in statistics, neural networks, recent focus in Machine Learning is on non-linear models using SVMs   Examples: •  Predicting sales amounts of new product based on advertising expenditure •  Predicting your score on Netflix •  Time series prediction of stock market indices Machine Learning Carla Brodley, Tufts University 11 Unsupervised Learning: Clustering   Given a set of data points, each described by a set of attributes, find clusters such that: F1 xx •  Inter-cluster similarity is xxxx x x maximized xx xxxx x x xxx •  Intra-cluster similarity is minimized F2   Requires the definition of a similarity measure Machine Learning Carla Brodley, Tufts University 12 6
  • 7. Clustering Applications Goal: divide customers into distinct groups based on behavior or demographics, with the goal of selecting a marketing target •  Training data: Use detailed record of transactions, web behavior, demographics, etc •  Attributes: Web pages visited, call frequency, length of call, financial status, marital status, size of investment, etc.   Online recommender systems (Netflix, Amazon, Perseus Digital Library) Machine Learning Carla Brodley, Tufts University 13 Clustering Application: Energy Use Profiles Goal: identify similar energy-use customer profiles to improve billing scheme •  Training data: energy use profiles of commercial customers •  Attributes: time series of energy usage Cust 12:00 1:00 … 1 45.5 65.2 … 2 34.2 76.3 … Examine customer demographics of each cluster Machine Learning Carla Brodley, Tufts University 14 7
  • 8. What types of data are there?   What types of features describe each example? •  Discrete: town that you live in (e.g., Somerville, Medford, Boston, Cambridge) •  Continuous: salary •  Ordinal: age •  Relational: sister of   How are data points related? •  Independent: each represents a different student •  Not independent: financial indicators for a particular day are related to the previous day Machine Learning Carla Brodley, Tufts University 15 Classification: Example Dataset Age Education Marital Status Race Gender Status 39 Bachelors Never-married White … Male Poor 50 Bachelors Married White … Male Poor 38 HS-grad Divorced White … Male Poor 53 11th Married Black … Male Poor 28 Bachelors Married Black … Female Poor 37 Masters Married White … Female Poor 52 HS-grad Married White … Male Rich 31 Masters Never-married White … Female Rich 42 Bachelors Married White … Male Rich 37 Some-college Married Black … Male Rich 30 Bachelors Married Asian … Male Rich 23 Bachelors Never-married White … Female Poor 32 Assoc-acdm Never-married Black … Male Poor 40 Assoc-voc Married Asian … Male Rich Machine Learning Carla Brodley, Tufts University 16 8
  • 9. Classification Application: Census Data   Given a set of examples (census data from 1990), each described by a set of attributes, and labeled as either {rich or poor}   Two types of attributes: •  Categorical: attributes that take on one of a set of values (e.g., race, marital status) •  Numeric: real-valued attribute   Find a model for the class attribute (wealth) as a function of the values of other attributes (employment, marital status, education level, age, …)   Goal: predict the wealth of people not in the training data Machine Learning Carla Brodley, Tufts University 17 Appropriate Applications for Supervised Learning   Situations in which there is no human expert   Situations where a human can perform the task but not how they do it   Situations where the desired function is changing frequently   Situations where each user needs a customized function Machine Learning Carla Brodley, Tufts University 18 9
  • 10. An Example Learning Problem Inst. X1 X2 X3 X4 y 1 0 0 1 0 0 2 0 1 0 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0 Machine Learning Carla Brodley, Tufts University 19 Machine Learning Carla Brodley, Tufts University 20 10
  • 11. Machine Learning Carla Brodley, Tufts University 21 Machine Learning Carla Brodley, Tufts University 22 11
  • 12. Machine Learning Carla Brodley, Tufts University 23 Machine Learning Carla Brodley, Tufts University 24 12
  • 13. Classification k-Nearest Neighbor o oo o oo oo oo xxxx x x xxx ? Machine Learning Carla Brodley, Tufts University 25 Classification k-Nearest Neighbor ? o oo o oo oo oo xxxx x x xxx Machine Learning Carla Brodley, Tufts University 26 13
  • 14. Classification k-Nearest Neighbor o oo oo o o oo ?x o xxxx x xxx Assign majority class of the k nearest neighbors Machine Learning Carla Brodley, Tufts University 27 Real World Issues and k-NN   Non-uniform costs?   Missing values?   Noise in class label? Machine Learning Carla Brodley, Tufts University 28 14
  • 15. Real World Issues and k-NN   Non-uniform costs? •  Weight votes by cost   Missing values? •  Take mean or class mean   Noise in class label? •  Increase k Machine Learning Carla Brodley, Tufts University 29 k-Nearest Neighbor Issues Computation: must look at distance of query to every point Choosing k Effect of outliers and noise Euclidean distance metric - requires normalization - problems in high dimensions - treats all features as equally important Machine Learning Carla Brodley, Tufts University 30 15