Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Potter’S Wheel


Published on

A demostration on the Potter's wheel tool

Published in: Technology
  • Be the first to comment

Potter’S Wheel

  1. 1. Potter’s Wheel MaxQDPro Team Anjan.K Harish.R Kiran.H.T Vaishak.P Venkatesh Kumar Rathod II Sem M.Tech CSE 05/22/09 MaxQDPro: Potter's Wheel 1
  2. 2. Outline  Automated Vs Conventional approaches  Data Cleaning and Transformation Revisited  Architecture  Features  Requirements to Run  How to install?  Usage Instructions  Conclusion 05/22/09 MaxQDPro: Potter's Wheel 2
  3. 3. Conventional Vs Automated approaches  Problem of Conventional approaches  Time consuming (many iterations), long waiting periods  Users have to write complex transformation scripts  Separate Tools for auditing and transformation  Potter‘s Wheel approach:  Interactive system, instant feedback  Integration of both, data auditing and transformation  Intuitive User Interface – spreadsheet like application 05/22/09 MaxQDPro: Potter's Wheel 3
  4. 4. Data Cleaning Revisited  Tasks  Fill in missing values  Identify outliers and smooth out noisy data  Correct inconsistent data  Resolve redundancy caused by data integration  Data discrepancy detection  Data scrubbing  Data auditing 05/22/09 MaxQDPro: Potter's Wheel 4
  5. 5. Data Transformation Revisited  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range  min-max normalization  z-score normalization  normalization by decimal scaling  Attribute/feature construction  New attributes constructed from the given ones 05/22/09 MaxQDPro: Potter's Wheel 5
  6. 6. Architecture 05/22/09 MaxQDPro: Potter's Wheel 6
  7. 7. Potter’s Wheel- features  Instead of complex transform specifications with regular expressions or custom programs  user specifies by example (e.g. splitting)  Data auditing extensible with user defined domains  Parse „Tayler, Jane, JFK to ORD on April 23, 2000 Coach“ as „[A-Za-z,]* <Airport> to <Airport> on <Date> <Class>“ instead of „[A-Za-z,]* [A-Z]³ to [A-Z]³ on [A-Za-z]* [0-9]*, [0-9]* [A-Za-z]*  Allows easier detection of e.g. logical errors like false airport codes  Potter‘s Wheel uses Minimun description length method to balance this tradeoff and choose appropriate structure  Data auditing in background on the fly (data streaming also possible)  Reorderer allows sorting on the fly  User only works on a view – real data isn‘t changed until user exports set of transforms e.g. as C program an runs it on the real data  Undo without problems: just delete unwanted transform from sequence and redo everything else 05/22/09 MaxQDPro: Potter's Wheel 7
  8. 8. Data Transformation & Cleaning  Its is an interactive tool that tightly integrates data transformation, discrepancy detection, and data analysis.  Spread-sheet kind of interface.  Checks for discrepancies in current transformed version of the data, and flags them to the user as they are found.  Users can specify transforms through graphical operations  Discrepancy detection is done in a highly customizable manner. Users can define custom domains with specific constraints that values must satisfy.  Tool automatically infers the structure of the data in terms of these user-defined domains, and checks the data for appropriate constraint violations.  These transformations can subsequently stored as C++ or Perl programs, or A-B-C macros, for subsequent application. 05/22/09 MaxQDPro: Potter's Wheel 8
  9. 9. Data Analysis  Analysis and transformation go hand in hand  Analysis can help to identify patterns in the data that can be used to transform it, which in turn can convert it into a format suitable for analysis.  Tool allows users to interactively transform and analysis large data sets or even infinite data streams, without any waits.  Tool allows uses to analyze data by partitioning it and computing aggregates on the partitions.  The partitioning can be performed recursively along any column of the data  Tool also allows users to see example data values from any partition using a spreadsheet-like interface.  User can sort and scroll interactively along any dimension to explore the data in detail. 05/22/09 MaxQDPro: Potter's Wheel 9
  10. 10. Requirements to Run  Runs on any platform since code is platform independent.  For Running Binaries  JRE 1.1.8 and Swing 1.1.x above versions  For Modifying and Compiling  Requires two components: Java and C++.  Java component needs to be compilied using JDK;  C++ component is organized into Visual C++ projects, so will compile easily using Visual C++ (Visual C++ 6.0).  GNU tools like gcc/make can also be used on Linux platform.  Strongly suggest to use JDK 1.3 and above. 05/22/09 MaxQDPro: Potter's Wheel 10
  11. 11. How to install?  Installation Steps  Extract or abc13.tar.gz in some convenient directory using some winrar or any extraxting softwares, say X.  Add X1.3jfe to your CLASSPATH and PATH environment variables PATH : set path=%path%;JDK or JRE path till bin CLASSPATH: set classpath=%classpath%;X1.3jfe  Make sure that there is C:temp directory. Potter's Wheel will write log files *.jgl to this directory. 05/22/09 MaxQDPro: Potter's Wheel 11
  12. 12. Usage Instruction  Type “java Jfe” from the command prompt to run the application c:abc131.3jfe>java Jfe  Data source can either be file or ODBC source that can be mentioned using –file or –odbc options respectively c:abc131.3jfe>java Jfe –odbc <databasename>  Choose correct source and click ok.  Start analyzing, cleaning, transforming data. 05/22/09 MaxQDPro: Potter's Wheel 12
  13. 13. Screenshots 05/22/09 MaxQDPro: Potter's Wheel 13
  14. 14. Screenshots 05/22/09 MaxQDPro: Potter's Wheel 14
  15. 15. 05/22/09 MaxQDPro: Potter's Wheel 15
  16. 16. 05/22/09 MaxQDPro: Potter's Wheel 16
  17. 17. Conclusion  Problems:  Usability of User Interface  How does duplicate elimination work?  Kind of a black box system  General Open Problems of Data Cleaning:  (Automatic) correction of wrong values  Mask wrong values but keep them  Keep several possible values at the same time (2*age. 2*birthday)  Leeds to problems if other values depend on a certain alternative and this turns out to be wrong  Maintenance of cleaned data, especially if sources can‘t be cleaned  Data cleaning framework desireable 05/22/09 MaxQDPro: Potter's Wheel 17
  18. 18. Summary  Elementary Concepts Revisited:  Transformation and Cleaning  Potter’s Wheel  Architecture  Features  Deep Insight  Usage  Hands On the tools 05/22/09 MaxQDPro: Potter's Wheel 18