1.
Christof Monz
Informatics Institute
University of Amsterdam
Data Mining
Week 1: Introduction
Today’s Class
Christof Monz
Data Minging - Week 1: Introduction
1
Overview of Data Mining
Overview of Machine Learning
Course administrivia
2.
What’s Data Mining?
Christof Monz
Data Minging - Week 1: Introduction
2
Data: Records, web pages, documents, etc.
Mining: The process or business of extracting
ore or minerals from the ground (The American
Heritage)
Data Mining: The nontrivial extraction of
implicit, previously unknown, and potentially
useful information from large amounts of data
Why Data Mining?
Christof Monz
Data Minging - Week 1: Introduction
3
There is an abundance of data resources:
commercial databases, intranets, the Internet,
. . .
These resources contain a large amount of
valuable data
The best way to structure the data depends on
how one wants to exploit it
Manual data organization is very laborious and
expensive
There is a need to automate this process
3.
Some Application Areas
Christof Monz
Data Minging - Week 1: Introduction
4
Customer analysis (what impacts customer
behavior?)
Medical research (what is the impact of
lifestyle/drug eﬀects?)
Insurance (risk assessment)
Stock investment (which factors impact stock
performance?)
Fraud detection (when is a transaction likely to
be fraudulent?)
The Need for Automated Analysis
Christof Monz
Data Minging - Week 1: Introduction
5
Much of the available data is never analyzed!
4.
What is and isn’t Data Mining
Christof Monz
Data Minging - Week 1: Introduction
6
Look up in an electronically available phone
book what John Doe’s phone number and
address is (isn’t Data Mining but database
management)
Infer from analyzing a number of web pages
what John Doe’s phone number is, although
this information is not expressed explicitly (is
Data Mining)
Situating Data Mining
Christof Monz
Data Minging - Week 1: Introduction
7
Data Mining lies on the intersection of a
number of research areas
5.
Data Mining Tasks
Christof Monz
Data Minging - Week 1: Introduction
8
Prediction
• Use some variables to predict unknown or future values
of other variables
Description
• Find human-interpretable patterns that describe the data
Some Data Mining Tasks
Christof Monz
Data Minging - Week 1: Introduction
9
Classiﬁcation (Predictive)
Clustering (Descriptive)
Association Rule Discovery (Descriptive)
Sequential Pattern Discovery (Descriptive)
Regression (Predictive)
Deviation Detection (Predictive)
6.
Classiﬁcation
Christof Monz
Data Minging - Week 1: Introduction
10
Given a collection of records (training set)
• Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function of
the values of other attributes
Goal: previously unseen records should be
assigned a class as accurately as possible.
• A test set is used to determine the accuracy of the model
Example: Direct Marketing
Christof Monz
Data Minging - Week 1: Introduction
11
Goal: Reduce cost of mailing by targeting a set
of consumers likely to buy a new cell-phone
product
Approach:
• Use the data for a similar product introduced before
• We know which customers decided to buy and which
decided otherwise. This buy/don’t buy decision forms
the class attribute
• Collect various demographic, lifestyle, and
company-interaction related information about all such
customers (where they stay, how much they earn, . . . )
• Use this information as input attributes to learn a
classiﬁer model
7.
Classify This!
Christof Monz
Data Minging - Week 1: Introduction
12
Some Observations
Christof Monz
Data Minging - Week 1: Introduction
13
Training data (examples for which the class is
known)
Feature extraction (what are the ’things’ that
are relevant to predict a class?)
Feature weight (how important is a feature?)
Feature combination (sometimes features act
together)
Over-ﬁtting (some features don’t generalize
well)
Evaluation (how accurate is the prediction?)
8.
Machine Learning
Christof Monz
Data Minging - Week 1: Introduction
14
The research area of machine learning
investigates and formalizes the challenge of
prediction and description by computer
Machine learning plays a central role in data
mining
It is used for:
• Building new models
• Adapting existing models to new situations
• Comparing the performance of competing models
Machine Learning is . . .
Christof Monz
Data Minging - Week 1: Introduction
15
. . . the principles, methods, and algorithms for
learning and prediction on the basis of past
experience
. . . already everywhere: speech recognition,
hand-written character recognition, computer
vision, information retrieval, operating systems,
compilers, fraud detection, security, defense
applications, . . .
9.
Learning
Christof Monz
Data Minging - Week 1: Introduction
16
Steps
• entertain a (biased) set of possibilities
• adjust predictions based on feedback
• rethink the set of possibilities
Principles of learning are ‘universal’
• society (e.g., scientiﬁc community)
• animal (e.g., human)
• machine
Learning and Prediction
Christof Monz
Data Minging - Week 1: Introduction
17
We make predictions all the time but rarely
investigate the processes underlying our
predictions
In carrying out scientiﬁc research we are also
governed by how theories are evaluated
To automate the process of making predictions
we need to understand in addition how we
search and reﬁne ‘theories’
10.
Learning: Key Steps
Christof Monz
Data Minging - Week 1: Introduction
18
Data and assumptions
• What data is available for the learning task?
• What can we assume about the problem?
Representation
• How should we represent the examples to be classiﬁed?
Evaluation and Estimation
• How well are we doing?
• How do we adjust our predictions based on the
feedback?
• Can we rethink the approach to do even better?
Example
Christof Monz
Data Minging - Week 1: Introduction
19
A classiﬁcation problem: predict the grades for
students taking this course
Key Steps:
1. data
2. assumptions
3. representation
4. estimation
5. evaluation
6. model selection
11.
Example
Christof Monz
Data Minging - Week 1: Introduction
20
Key Steps:
1. data: what ‘past experience’ can we rely on?
2. assumptions: what can we assume about the students or
the course?
3. representation: how do we ‘summarize’ a student?
4. estimation: how do we construct a map from students to
grades?
5. evaluation: how well are we predicting?
6. model selection: perhaps we can do even better?
Example: Data
Christof Monz
Data Minging - Week 1: Introduction
21
The data we have available (in principle):
• Names and grades of students in past years ML courses
• Academic record of past and current students
Training data:
Student ML course 1 course 2 . . .
Peter A B A . . .
David B A A . . .
Test data:
Student ML course 1 course 2 . . .
Jack ? C A . . .
Kate ? A A . . .
12.
Assumptions
Christof Monz
Data Minging - Week 1: Introduction
22
There are many assumptions we can make to
facilitate predictions:
• The course has remained roughly the same over the years
• Each student performs independently from others
Example: Representation
Christof Monz
Data Minging - Week 1: Introduction
23
Academic records are rather diverse so we might
limit the summaries to a select few courses
For example, we can summarize the ith
student
(say David) with a vector: xi = [B A A]
The available data in this representation:
Training Testing
Student ML grade Student ML grade
x1 A x1 ?
x2 B x2 ?
. . . . . . . . . . . .
13.
Example: Estimation
Christof Monz
Data Minging - Week 1: Introduction
24
Given the training data
Student ML grade
x1 A
x2 B
. . . . . .
ﬁnd a mapping from input vectors x to ‘labels’
y encoding the grades for the ML course.
Possible solution (nearest neighbor classiﬁer):
1. For any student x in the test set ﬁnd the ‘closest’
student xi in the training set
2. Predict yi as the grade of the closest student
Example: Evaluation
Christof Monz
Data Minging - Week 1: Introduction
25
How can we tell how good our predictions are?
• We can wait till the end of this course
• We can try to assess the accuracy based on the data we
already have (part of the training data)
Possible solution:
• Divide the training set further into training and test sets
• Evaluate the classiﬁer constructed on the basis of only
the smaller training set on the new test set
14.
Example: Model Selection
Christof Monz
Data Minging - Week 1: Introduction
26
We can reﬁne
• the estimation algorithm (e.g., using a classiﬁer other
than the nearest neighbor classiﬁer)
• the representation (e.g., base the summaries on a
diﬀerent set of courses)
• the assumptions (e.g., perhaps students work in groups)
etc.
We have to rely on the method of evaluating
the accuracy of our predictions to select among
the possible reﬁnements
Types of Learning Approaches
Christof Monz
Data Minging - Week 1: Introduction
27
Supervised learning: where we get a set of
training inputs and outputs
• E.g., classiﬁcation, regression
Unsupervised learning: where we are
interested in capturing inherent organization in
the data
• E.g., clustering, density estimation
Reinforcement learning: where we only get
feedback in the form of how well we are doing
(not what we should be doing)
• E.g., planning
15.
Challenges of Data Mining
Christof Monz
Data Minging - Week 1: Introduction
28
Scalability
Dimensionality/Complexity
Data quality
Data ownership
Privacy considerations
Continually updated data
Recap
Christof Monz
Data Minging - Week 1: Introduction
29
Diﬀerence between data mining and other
research areas
Applications of data mining
Need for automation and the use of machine
learning
Key steps in machine learning
16.
About This Course
Christof Monz
Data Minging - Week 1: Introduction
30
This course does not:
• give a comprehensive introduction to data mining
• cover how to adapt data mining to speciﬁc applications
• cover feature extraction
• cover evaluation issues in detail
This course does:
• focus on the pre-dominant approach in data mining:
machine learning
• sketch some of the example applications
• introduce a representative selection of machine learning
techniques used in data mining
• focus on the algorithmic fundamentals of machine
learning
Approaches Covered
Christof Monz
Data Minging - Week 1: Introduction
31
Linear regression (regression)
Decision Trees (classiﬁcation)
Neural Networks (classiﬁcation)
k-Nearest-Neighbors (classiﬁcation)
Naive Bayes (classiﬁcation)
K-Means (clustering)
Hierarchical Clustering (clustering)
17.
What to get out of this Course
Christof Monz
Data Minging - Week 1: Introduction
32
At the end of this course you will have learned:
• what type of problems can be addressed by data mining
techniques
• what the most common machine learning approaches in
data mining are
• which machine learning approaches are appropriate for a
given type of data mining application
• the algorithmic fundamentals of a number of relevant
machine learning approaches
Course Administrivia
Christof Monz
Data Minging - Week 1: Introduction
33
Exam counts for 40%, homework counts for
20%, practical assignments (40%)
Lectures are on Tuesday 9-11am (D1.116)
Tutorials (werk colleges) are on Thursday
9-11am (G0.05) and Fridays 9-11am (G5.29)
Labs are on Thursday 11am-1pm (G0.18)
or Friday 11am-1pm (G0.18)
18.
Course Administrivia
Christof Monz
Data Minging - Week 1: Introduction
34
Teaching assistants:
Yijin He (email: jiyinhe@gmail.com)
(English only!)
Spyros Martzoukos (email:
S.Martzoukos@uva.nl) (English only!)
Course web page: on Blackboard
Check course web page regularly for
announcements, slides, . . .
Be the first to comment