Data Mining Technology
II Sem M.Tech
Need for Dimensionality Reduction
Data Mining elementary concepts
Hands On Problem-Q3
Potter’s Wheel-Data Cleaning Tool
Need for Dimensionality Reduction
It is easy to collect data but accumulates
in an unprecedented speed.
Data is not collected only for data mining
Data preprocessing is an important part
for effective machine learning and data
Dimensionality reduction is an effective
approach to downsizing data
Learning and data mining techniques may not
be effective for high-dimensional data due its
Query accuracy and efficiency degrade rapidly
as the dimension increases.
Visualization: projection of high-dimensional
data onto 2D or 3D.
Data compression: efficient storage and
Noise removal: positive effect on query
Principal Component Analysis
PCA is a statistical technique used in face recognition
and image compression and is unsupervised linear
A common technique for finding patterns in data of
high dimension. Mining for principal component in
Reduce the dimensionality of a data set by finding a
new set of variables, smaller than the original set of
Retains most of the sample's information.
Ex: High resolution image transformed to
low resolution image.
Geometric Picture of Principal Components (PCs)
Knowledge Discovery (KDD) Process
Data mining—core of Pattern Evaluation
Data Mining: Confluence of Multiple Disciplines
Learning Data Mining
Question 3 (1.3 of Chap 1 of Han &
Suppose your task as a software engineer at Big
university is to design a data mining system to
examine the university course database, which
contains the following information: name,
address, status, course taken, the cumulative
grade point average(GPA) of each student.
Describe the architecture you would choose.
What is the purpose of each component of this
Data Mining System
Exam DB University system
Response Attribution Evaluation
Back Office Systems Interface
Problem of conventional approaches
Time consuming (many iterations), long waiting
Users have to write complex transformation scripts
Separate Tools for auditing and transformation
Potter‘s Wheel approach:
Interactive system, instant feedback
Integration of both, data auditing and transformation
Intuitive User Interface – spreadsheet like
Potter’s Wheel- features
Instead of complex transform specifications with regular expressions or
custom programs user specifies by example (e.g. splitting)
Data auditing extensible with user defined domains
Parse „Tayler, Jane, JFK to ORD on April 23, 2000 Coach“ as „[A-Za-z,]*
<Airport> to <Airport> on <Date> <Class>“ instead of „[A-Za-z,]* [A-Z]³
to [A-Z]³ on [A-Za-z]* [0-9]*, [0-9]* [A-Za-z]*
Allows easier detection of e.g. logical errors like false airport codes
Problem: tradeoff between overfitting and underfitting structure
Potter‘s Wheel uses Minimun description length method to balance this
tradeoff and choose appropriate structure
Data auditing in background on the fly (data streaming also possible)
Reorderer allows sorting on the fly
User only works on a view – real data isn‘t changed until user exports
set of transforms e.g. as C program an runs it on the real data
Undo without problems: just delete unwanted transform from sequence
and redo everything else
Potter‘s Wheel - Conclusion
Usability of User Interface
How does duplicate elimination work?
Kind of a black box system
General Open Problems of Data Cleaning:
(Automatic) correction of wrong values
Mask wrong values but keep them
Keep several possible values at the same time (2*age.
Leeds to problems if other values depend on a certain alternative
and this turns out to be wrong
Maintenance of cleaned data, especially if sources
can‘t be cleaned
Data cleaning framework desireable
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.