Venkatesh Kumar Rathod
II Sem M.Tech CSE
05/22/09 MaxQDPro: Potter's Wheel 1
Automated Vs Conventional approaches
Data Cleaning and Transformation Revisited
Requirements to Run
How to install?
05/22/09 MaxQDPro: Potter's Wheel 2
Conventional Vs Automated approaches
Problem of Conventional approaches
Time consuming (many iterations), long waiting periods
Users have to write complex transformation scripts
Separate Tools for auditing and transformation
Potter‘s Wheel approach:
Interactive system, instant feedback
Integration of both, data auditing and transformation
Intuitive User Interface – spreadsheet like application
05/22/09 MaxQDPro: Potter's Wheel 3
Data Cleaning Revisited
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
Data discrepancy detection
05/22/09 MaxQDPro: Potter's Wheel 4
Data Transformation Revisited
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
normalization by decimal scaling
New attributes constructed from the given ones
05/22/09 MaxQDPro: Potter's Wheel 5
Potter’s Wheel- features
Instead of complex transform specifications with regular expressions or
custom programs user specifies by example (e.g. splitting)
Data auditing extensible with user defined domains
Parse „Tayler, Jane, JFK to ORD on April 23, 2000 Coach“ as „[A-Za-z,]*
<Airport> to <Airport> on <Date> <Class>“ instead of „[A-Za-z,]* [A-Z]³
to [A-Z]³ on [A-Za-z]* [0-9]*, [0-9]* [A-Za-z]*
Allows easier detection of e.g. logical errors like false airport codes
Potter‘s Wheel uses Minimun description length method to balance this
tradeoff and choose appropriate structure
Data auditing in background on the fly (data streaming also possible)
Reorderer allows sorting on the fly
User only works on a view – real data isn‘t changed until user exports
set of transforms e.g. as C program an runs it on the real data
Undo without problems: just delete unwanted transform from sequence
and redo everything else
05/22/09 MaxQDPro: Potter's Wheel 7
Data Transformation & Cleaning
Its is an interactive tool that tightly integrates data transformation,
discrepancy detection, and data analysis.
Spread-sheet kind of interface.
Checks for discrepancies in current transformed version of the data,
and flags them to the user as they are found.
Users can specify transforms through graphical operations
Discrepancy detection is done in a highly customizable manner.
Users can define custom domains with specific constraints that
values must satisfy.
Tool automatically infers the structure of the data in terms of these
user-defined domains, and checks the data for appropriate constraint
These transformations can subsequently stored as C++ or Perl
programs, or A-B-C macros, for subsequent application.
05/22/09 MaxQDPro: Potter's Wheel 8
Analysis and transformation go hand in hand
Analysis can help to identify patterns in the data that can be
used to transform it, which in turn can convert it into a format
suitable for analysis.
Tool allows users to interactively transform and analysis large
data sets or even infinite data streams, without any waits.
Tool allows uses to analyze data by partitioning it and
computing aggregates on the partitions.
The partitioning can be performed recursively along any
column of the data
Tool also allows users to see example data values from any
partition using a spreadsheet-like interface.
User can sort and scroll interactively along any dimension to
explore the data in detail.
05/22/09 MaxQDPro: Potter's Wheel 9
Requirements to Run
Runs on any platform since code is platform independent.
For Running Binaries
JRE 1.1.8 and Swing 1.1.x above versions
For Modifying and Compiling
Requires two components: Java and C++.
Java component needs to be compilied using JDK;
C++ component is organized into Visual C++ projects, so will
compile easily using Visual C++ (Visual C++ 6.0).
GNU tools like gcc/make can also be used on Linux platform.
Strongly suggest to use JDK 1.3 and above.
05/22/09 MaxQDPro: Potter's Wheel 10
How to install?
Extract abc13.zip or abc13.tar.gz in some convenient
directory using some winrar or any extraxting softwares,
Add X1.3jfe to your CLASSPATH and PATH
PATH : set path=%path%;JDK or JRE path till bin
CLASSPATH: set classpath=%classpath%;X1.3jfe
Make sure that there is C:temp directory. Potter's Wheel
will write log files *.jgl to this directory.
05/22/09 MaxQDPro: Potter's Wheel 11
Type “java Jfe” from the command prompt to run the
Data source can either be file or ODBC source that
can be mentioned using –file or –odbc options
c:abc131.3jfe>java Jfe –odbc <databasename>
Choose correct source and click ok.
Start analyzing, cleaning, transforming data.
05/22/09 MaxQDPro: Potter's Wheel 12
Usability of User Interface
How does duplicate elimination work?
Kind of a black box system
General Open Problems of Data Cleaning:
(Automatic) correction of wrong values
Mask wrong values but keep them
Keep several possible values at the same time (2*age.
Leeds to problems if other values depend on a certain
alternative and this turns out to be wrong
Maintenance of cleaned data, especially if sources
can‘t be cleaned
Data cleaning framework desireable
05/22/09 MaxQDPro: Potter's Wheel 17
Elementary Concepts Revisited:
Transformation and Cleaning
Hands On the tools
05/22/09 MaxQDPro: Potter's Wheel 18