Christof Monz
Informatics Institute
University of Amsterdam
Data Mining
Week 1: Introduction
Today’s Class
Christof Monz
Data Minging - Week 1: Introduction
1
Overview of Data Mining
Overview of Machine Learning
Course administrivia
What’s Data Mining?
Christof Monz
Data Minging - Week 1: Introduction
2
Data: Records, web pages, documents, etc.
Mining: The process or business of extracting
ore or minerals from the ground (The American
Heritage)
Data Mining: The nontrivial extraction of
implicit, previously unknown, and potentially
useful information from large amounts of data
Why Data Mining?
Christof Monz
Data Minging - Week 1: Introduction
3
There is an abundance of data resources:
commercial databases, intranets, the Internet,
. . .
These resources contain a large amount of
valuable data
The best way to structure the data depends on
how one wants to exploit it
Manual data organization is very laborious and
expensive
There is a need to automate this process
Some Application Areas
Christof Monz
Data Minging - Week 1: Introduction
4
Customer analysis (what impacts customer
behavior?)
Medical research (what is the impact of
lifestyle/drug effects?)
Insurance (risk assessment)
Stock investment (which factors impact stock
performance?)
Fraud detection (when is a transaction likely to
be fraudulent?)
The Need for Automated Analysis
Christof Monz
Data Minging - Week 1: Introduction
5
Much of the available data is never analyzed!
What is and isn’t Data Mining
Christof Monz
Data Minging - Week 1: Introduction
6
Look up in an electronically available phone
book what John Doe’s phone number and
address is (isn’t Data Mining but database
management)
Infer from analyzing a number of web pages
what John Doe’s phone number is, although
this information is not expressed explicitly (is
Data Mining)
Situating Data Mining
Christof Monz
Data Minging - Week 1: Introduction
7
Data Mining lies on the intersection of a
number of research areas
Data Mining Tasks
Christof Monz
Data Minging - Week 1: Introduction
8
Prediction
• Use some variables to predict unknown or future values
of other variables
Description
• Find human-interpretable patterns that describe the data
Some Data Mining Tasks
Christof Monz
Data Minging - Week 1: Introduction
9
Classification (Predictive)
Clustering (Descriptive)
Association Rule Discovery (Descriptive)
Sequential Pattern Discovery (Descriptive)
Regression (Predictive)
Deviation Detection (Predictive)
Classification
Christof Monz
Data Minging - Week 1: Introduction
10
Given a collection of records (training set)
• Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function of
the values of other attributes
Goal: previously unseen records should be
assigned a class as accurately as possible.
• A test set is used to determine the accuracy of the model
Example: Direct Marketing
Christof Monz
Data Minging - Week 1: Introduction
11
Goal: Reduce cost of mailing by targeting a set
of consumers likely to buy a new cell-phone
product
Approach:
• Use the data for a similar product introduced before
• We know which customers decided to buy and which
decided otherwise. This buy/don’t buy decision forms
the class attribute
• Collect various demographic, lifestyle, and
company-interaction related information about all such
customers (where they stay, how much they earn, . . . )
• Use this information as input attributes to learn a
classifier model
Classify This!
Christof Monz
Data Minging - Week 1: Introduction
12
Some Observations
Christof Monz
Data Minging - Week 1: Introduction
13
Training data (examples for which the class is
known)
Feature extraction (what are the ’things’ that
are relevant to predict a class?)
Feature weight (how important is a feature?)
Feature combination (sometimes features act
together)
Over-fitting (some features don’t generalize
well)
Evaluation (how accurate is the prediction?)
Machine Learning
Christof Monz
Data Minging - Week 1: Introduction
14
The research area of machine learning
investigates and formalizes the challenge of
prediction and description by computer
Machine learning plays a central role in data
mining
It is used for:
• Building new models
• Adapting existing models to new situations
• Comparing the performance of competing models
Machine Learning is . . .
Christof Monz
Data Minging - Week 1: Introduction
15
. . . the principles, methods, and algorithms for
learning and prediction on the basis of past
experience
. . . already everywhere: speech recognition,
hand-written character recognition, computer
vision, information retrieval, operating systems,
compilers, fraud detection, security, defense
applications, . . .
Learning
Christof Monz
Data Minging - Week 1: Introduction
16
Steps
• entertain a (biased) set of possibilities
• adjust predictions based on feedback
• rethink the set of possibilities
Principles of learning are ‘universal’
• society (e.g., scientific community)
• animal (e.g., human)
• machine
Learning and Prediction
Christof Monz
Data Minging - Week 1: Introduction
17
We make predictions all the time but rarely
investigate the processes underlying our
predictions
In carrying out scientific research we are also
governed by how theories are evaluated
To automate the process of making predictions
we need to understand in addition how we
search and refine ‘theories’
Learning: Key Steps
Christof Monz
Data Minging - Week 1: Introduction
18
Data and assumptions
• What data is available for the learning task?
• What can we assume about the problem?
Representation
• How should we represent the examples to be classified?
Evaluation and Estimation
• How well are we doing?
• How do we adjust our predictions based on the
feedback?
• Can we rethink the approach to do even better?
Example
Christof Monz
Data Minging - Week 1: Introduction
19
A classification problem: predict the grades for
students taking this course
Key Steps:
1. data
2. assumptions
3. representation
4. estimation
5. evaluation
6. model selection
Example
Christof Monz
Data Minging - Week 1: Introduction
20
Key Steps:
1. data: what ‘past experience’ can we rely on?
2. assumptions: what can we assume about the students or
the course?
3. representation: how do we ‘summarize’ a student?
4. estimation: how do we construct a map from students to
grades?
5. evaluation: how well are we predicting?
6. model selection: perhaps we can do even better?
Example: Data
Christof Monz
Data Minging - Week 1: Introduction
21
The data we have available (in principle):
• Names and grades of students in past years ML courses
• Academic record of past and current students
Training data:
Student ML course 1 course 2 . . .
Peter A B A . . .
David B A A . . .
Test data:
Student ML course 1 course 2 . . .
Jack ? C A . . .
Kate ? A A . . .
Assumptions
Christof Monz
Data Minging - Week 1: Introduction
22
There are many assumptions we can make to
facilitate predictions:
• The course has remained roughly the same over the years
• Each student performs independently from others
Example: Representation
Christof Monz
Data Minging - Week 1: Introduction
23
Academic records are rather diverse so we might
limit the summaries to a select few courses
For example, we can summarize the ith
student
(say David) with a vector: xi = [B A A]
The available data in this representation:
Training Testing
Student ML grade Student ML grade
x1 A x1 ?
x2 B x2 ?
. . . . . . . . . . . .
Example: Estimation
Christof Monz
Data Minging - Week 1: Introduction
24
Given the training data
Student ML grade
x1 A
x2 B
. . . . . .
find a mapping from input vectors x to ‘labels’
y encoding the grades for the ML course.
Possible solution (nearest neighbor classifier):
1. For any student x in the test set find the ‘closest’
student xi in the training set
2. Predict yi as the grade of the closest student
Example: Evaluation
Christof Monz
Data Minging - Week 1: Introduction
25
How can we tell how good our predictions are?
• We can wait till the end of this course
• We can try to assess the accuracy based on the data we
already have (part of the training data)
Possible solution:
• Divide the training set further into training and test sets
• Evaluate the classifier constructed on the basis of only
the smaller training set on the new test set
Example: Model Selection
Christof Monz
Data Minging - Week 1: Introduction
26
We can refine
• the estimation algorithm (e.g., using a classifier other
than the nearest neighbor classifier)
• the representation (e.g., base the summaries on a
different set of courses)
• the assumptions (e.g., perhaps students work in groups)
etc.
We have to rely on the method of evaluating
the accuracy of our predictions to select among
the possible refinements
Types of Learning Approaches
Christof Monz
Data Minging - Week 1: Introduction
27
Supervised learning: where we get a set of
training inputs and outputs
• E.g., classification, regression
Unsupervised learning: where we are
interested in capturing inherent organization in
the data
• E.g., clustering, density estimation
Reinforcement learning: where we only get
feedback in the form of how well we are doing
(not what we should be doing)
• E.g., planning
Challenges of Data Mining
Christof Monz
Data Minging - Week 1: Introduction
28
Scalability
Dimensionality/Complexity
Data quality
Data ownership
Privacy considerations
Continually updated data
Recap
Christof Monz
Data Minging - Week 1: Introduction
29
Difference between data mining and other
research areas
Applications of data mining
Need for automation and the use of machine
learning
Key steps in machine learning
About This Course
Christof Monz
Data Minging - Week 1: Introduction
30
This course does not:
• give a comprehensive introduction to data mining
• cover how to adapt data mining to specific applications
• cover feature extraction
• cover evaluation issues in detail
This course does:
• focus on the pre-dominant approach in data mining:
machine learning
• sketch some of the example applications
• introduce a representative selection of machine learning
techniques used in data mining
• focus on the algorithmic fundamentals of machine
learning
Approaches Covered
Christof Monz
Data Minging - Week 1: Introduction
31
Linear regression (regression)
Decision Trees (classification)
Neural Networks (classification)
k-Nearest-Neighbors (classification)
Naive Bayes (classification)
K-Means (clustering)
Hierarchical Clustering (clustering)
What to get out of this Course
Christof Monz
Data Minging - Week 1: Introduction
32
At the end of this course you will have learned:
• what type of problems can be addressed by data mining
techniques
• what the most common machine learning approaches in
data mining are
• which machine learning approaches are appropriate for a
given type of data mining application
• the algorithmic fundamentals of a number of relevant
machine learning approaches
Course Administrivia
Christof Monz
Data Minging - Week 1: Introduction
33
Exam counts for 40%, homework counts for
20%, practical assignments (40%)
Lectures are on Tuesday 9-11am (D1.116)
Tutorials (werk colleges) are on Thursday
9-11am (G0.05) and Fridays 9-11am (G5.29)
Labs are on Thursday 11am-1pm (G0.18)
or Friday 11am-1pm (G0.18)
Course Administrivia
Christof Monz
Data Minging - Week 1: Introduction
34
Teaching assistants:
Yijin He (email: jiyinhe@gmail.com)
(English only!)
Spyros Martzoukos (email:
S.Martzoukos@uva.nl) (English only!)
Course web page: on Blackboard
Check course web page regularly for
announcements, slides, . . .

Dm week01 intro.handout

  • 1.
    Christof Monz Informatics Institute Universityof Amsterdam Data Mining Week 1: Introduction Today’s Class Christof Monz Data Minging - Week 1: Introduction 1 Overview of Data Mining Overview of Machine Learning Course administrivia
  • 2.
    What’s Data Mining? ChristofMonz Data Minging - Week 1: Introduction 2 Data: Records, web pages, documents, etc. Mining: The process or business of extracting ore or minerals from the ground (The American Heritage) Data Mining: The nontrivial extraction of implicit, previously unknown, and potentially useful information from large amounts of data Why Data Mining? Christof Monz Data Minging - Week 1: Introduction 3 There is an abundance of data resources: commercial databases, intranets, the Internet, . . . These resources contain a large amount of valuable data The best way to structure the data depends on how one wants to exploit it Manual data organization is very laborious and expensive There is a need to automate this process
  • 3.
    Some Application Areas ChristofMonz Data Minging - Week 1: Introduction 4 Customer analysis (what impacts customer behavior?) Medical research (what is the impact of lifestyle/drug effects?) Insurance (risk assessment) Stock investment (which factors impact stock performance?) Fraud detection (when is a transaction likely to be fraudulent?) The Need for Automated Analysis Christof Monz Data Minging - Week 1: Introduction 5 Much of the available data is never analyzed!
  • 4.
    What is andisn’t Data Mining Christof Monz Data Minging - Week 1: Introduction 6 Look up in an electronically available phone book what John Doe’s phone number and address is (isn’t Data Mining but database management) Infer from analyzing a number of web pages what John Doe’s phone number is, although this information is not expressed explicitly (is Data Mining) Situating Data Mining Christof Monz Data Minging - Week 1: Introduction 7 Data Mining lies on the intersection of a number of research areas
  • 5.
    Data Mining Tasks ChristofMonz Data Minging - Week 1: Introduction 8 Prediction • Use some variables to predict unknown or future values of other variables Description • Find human-interpretable patterns that describe the data Some Data Mining Tasks Christof Monz Data Minging - Week 1: Introduction 9 Classification (Predictive) Clustering (Descriptive) Association Rule Discovery (Descriptive) Sequential Pattern Discovery (Descriptive) Regression (Predictive) Deviation Detection (Predictive)
  • 6.
    Classification Christof Monz Data Minging- Week 1: Introduction 10 Given a collection of records (training set) • Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes Goal: previously unseen records should be assigned a class as accurately as possible. • A test set is used to determine the accuracy of the model Example: Direct Marketing Christof Monz Data Minging - Week 1: Introduction 11 Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product Approach: • Use the data for a similar product introduced before • We know which customers decided to buy and which decided otherwise. This buy/don’t buy decision forms the class attribute • Collect various demographic, lifestyle, and company-interaction related information about all such customers (where they stay, how much they earn, . . . ) • Use this information as input attributes to learn a classifier model
  • 7.
    Classify This! Christof Monz DataMinging - Week 1: Introduction 12 Some Observations Christof Monz Data Minging - Week 1: Introduction 13 Training data (examples for which the class is known) Feature extraction (what are the ’things’ that are relevant to predict a class?) Feature weight (how important is a feature?) Feature combination (sometimes features act together) Over-fitting (some features don’t generalize well) Evaluation (how accurate is the prediction?)
  • 8.
    Machine Learning Christof Monz DataMinging - Week 1: Introduction 14 The research area of machine learning investigates and formalizes the challenge of prediction and description by computer Machine learning plays a central role in data mining It is used for: • Building new models • Adapting existing models to new situations • Comparing the performance of competing models Machine Learning is . . . Christof Monz Data Minging - Week 1: Introduction 15 . . . the principles, methods, and algorithms for learning and prediction on the basis of past experience . . . already everywhere: speech recognition, hand-written character recognition, computer vision, information retrieval, operating systems, compilers, fraud detection, security, defense applications, . . .
  • 9.
    Learning Christof Monz Data Minging- Week 1: Introduction 16 Steps • entertain a (biased) set of possibilities • adjust predictions based on feedback • rethink the set of possibilities Principles of learning are ‘universal’ • society (e.g., scientific community) • animal (e.g., human) • machine Learning and Prediction Christof Monz Data Minging - Week 1: Introduction 17 We make predictions all the time but rarely investigate the processes underlying our predictions In carrying out scientific research we are also governed by how theories are evaluated To automate the process of making predictions we need to understand in addition how we search and refine ‘theories’
  • 10.
    Learning: Key Steps ChristofMonz Data Minging - Week 1: Introduction 18 Data and assumptions • What data is available for the learning task? • What can we assume about the problem? Representation • How should we represent the examples to be classified? Evaluation and Estimation • How well are we doing? • How do we adjust our predictions based on the feedback? • Can we rethink the approach to do even better? Example Christof Monz Data Minging - Week 1: Introduction 19 A classification problem: predict the grades for students taking this course Key Steps: 1. data 2. assumptions 3. representation 4. estimation 5. evaluation 6. model selection
  • 11.
    Example Christof Monz Data Minging- Week 1: Introduction 20 Key Steps: 1. data: what ‘past experience’ can we rely on? 2. assumptions: what can we assume about the students or the course? 3. representation: how do we ‘summarize’ a student? 4. estimation: how do we construct a map from students to grades? 5. evaluation: how well are we predicting? 6. model selection: perhaps we can do even better? Example: Data Christof Monz Data Minging - Week 1: Introduction 21 The data we have available (in principle): • Names and grades of students in past years ML courses • Academic record of past and current students Training data: Student ML course 1 course 2 . . . Peter A B A . . . David B A A . . . Test data: Student ML course 1 course 2 . . . Jack ? C A . . . Kate ? A A . . .
  • 12.
    Assumptions Christof Monz Data Minging- Week 1: Introduction 22 There are many assumptions we can make to facilitate predictions: • The course has remained roughly the same over the years • Each student performs independently from others Example: Representation Christof Monz Data Minging - Week 1: Introduction 23 Academic records are rather diverse so we might limit the summaries to a select few courses For example, we can summarize the ith student (say David) with a vector: xi = [B A A] The available data in this representation: Training Testing Student ML grade Student ML grade x1 A x1 ? x2 B x2 ? . . . . . . . . . . . .
  • 13.
    Example: Estimation Christof Monz DataMinging - Week 1: Introduction 24 Given the training data Student ML grade x1 A x2 B . . . . . . find a mapping from input vectors x to ‘labels’ y encoding the grades for the ML course. Possible solution (nearest neighbor classifier): 1. For any student x in the test set find the ‘closest’ student xi in the training set 2. Predict yi as the grade of the closest student Example: Evaluation Christof Monz Data Minging - Week 1: Introduction 25 How can we tell how good our predictions are? • We can wait till the end of this course • We can try to assess the accuracy based on the data we already have (part of the training data) Possible solution: • Divide the training set further into training and test sets • Evaluate the classifier constructed on the basis of only the smaller training set on the new test set
  • 14.
    Example: Model Selection ChristofMonz Data Minging - Week 1: Introduction 26 We can refine • the estimation algorithm (e.g., using a classifier other than the nearest neighbor classifier) • the representation (e.g., base the summaries on a different set of courses) • the assumptions (e.g., perhaps students work in groups) etc. We have to rely on the method of evaluating the accuracy of our predictions to select among the possible refinements Types of Learning Approaches Christof Monz Data Minging - Week 1: Introduction 27 Supervised learning: where we get a set of training inputs and outputs • E.g., classification, regression Unsupervised learning: where we are interested in capturing inherent organization in the data • E.g., clustering, density estimation Reinforcement learning: where we only get feedback in the form of how well we are doing (not what we should be doing) • E.g., planning
  • 15.
    Challenges of DataMining Christof Monz Data Minging - Week 1: Introduction 28 Scalability Dimensionality/Complexity Data quality Data ownership Privacy considerations Continually updated data Recap Christof Monz Data Minging - Week 1: Introduction 29 Difference between data mining and other research areas Applications of data mining Need for automation and the use of machine learning Key steps in machine learning
  • 16.
    About This Course ChristofMonz Data Minging - Week 1: Introduction 30 This course does not: • give a comprehensive introduction to data mining • cover how to adapt data mining to specific applications • cover feature extraction • cover evaluation issues in detail This course does: • focus on the pre-dominant approach in data mining: machine learning • sketch some of the example applications • introduce a representative selection of machine learning techniques used in data mining • focus on the algorithmic fundamentals of machine learning Approaches Covered Christof Monz Data Minging - Week 1: Introduction 31 Linear regression (regression) Decision Trees (classification) Neural Networks (classification) k-Nearest-Neighbors (classification) Naive Bayes (classification) K-Means (clustering) Hierarchical Clustering (clustering)
  • 17.
    What to getout of this Course Christof Monz Data Minging - Week 1: Introduction 32 At the end of this course you will have learned: • what type of problems can be addressed by data mining techniques • what the most common machine learning approaches in data mining are • which machine learning approaches are appropriate for a given type of data mining application • the algorithmic fundamentals of a number of relevant machine learning approaches Course Administrivia Christof Monz Data Minging - Week 1: Introduction 33 Exam counts for 40%, homework counts for 20%, practical assignments (40%) Lectures are on Tuesday 9-11am (D1.116) Tutorials (werk colleges) are on Thursday 9-11am (G0.05) and Fridays 9-11am (G5.29) Labs are on Thursday 11am-1pm (G0.18) or Friday 11am-1pm (G0.18)
  • 18.
    Course Administrivia Christof Monz DataMinging - Week 1: Introduction 34 Teaching assistants: Yijin He (email: jiyinhe@gmail.com) (English only!) Spyros Martzoukos (email: S.Martzoukos@uva.nl) (English only!) Course web page: on Blackboard Check course web page regularly for announcements, slides, . . .