Potter's Wheel : An Interactive
Data Cleaning System
(Raman and Hellerstein, Proc. VLDB, 2001)
Outline
• Problem
• Potter’s wheel
• Architecture
• Discrepancy detection
• Structures
• MDL metric
• Interactive Transfor...
Problem
• Data cleaning is an important process.
• Cleaning involves data auditing and data
transformations.
• Current sol...
Potter’s wheel
• Integrates transformation and discrepancy
detection.
• Interactive transformations.
• Reduced wait times.
Architecture
• 4 parts :
• Data source (tabular, not nested)
• Online reorderer (spreadsheet, sorting, dynamic
display)
• ...
Discrepancy detection
• Performed in the background automatically.
• Done by finding suitable structures :
• Structure is ...
Structures
• What makes a good structure :
• Recall (structure matches as many columns as
possible)
• Precision (structure...
MDL metric
• Distance length (DL) :
• Measure used to describe a set of column values, given a
structure.
• DL(v, S) = (1 ...
Interactive
transformations
• GUI provided for simple transformations :
• Add, drop, copy, fold, etc.
• Undo supported.
• ...
Splitting
• Done by example.
• MDL metric used to infer structures.
• Once structure is inferred, splitting follows :
• Le...
Evaluation
• Structure inference algorithm works :
• Based on examples.
• Based on algorithm’s definition.
• Decreasing sp...
Conclusion
• Potter’s wheel tool :
• Interactive
• Integrated
• Future work :
• Transforming nested data
• Complex transfo...
Thank you
Upcoming SlideShare
Loading in …5
×

Potters wheel

647 views
480 views

Published on

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
647
On SlideShare
0
From Embeds
0
Number of Embeds
97
Actions
Shares
0
Downloads
2
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Potters wheel

  1. 1. Potter's Wheel : An Interactive Data Cleaning System (Raman and Hellerstein, Proc. VLDB, 2001)
  2. 2. Outline • Problem • Potter’s wheel • Architecture • Discrepancy detection • Structures • MDL metric • Interactive Transformations • Splitting • Evaluation • Conclusion • Discussion
  3. 3. Problem • Data cleaning is an important process. • Cleaning involves data auditing and data transformations. • Current solutions (ETL’s and reengineering tools) : • Iterative. • Not interactive. • Long wait times.
  4. 4. Potter’s wheel • Integrates transformation and discrepancy detection. • Interactive transformations. • Reduced wait times.
  5. 5. Architecture • 4 parts : • Data source (tabular, not nested) • Online reorderer (spreadsheet, sorting, dynamic display) • Automatic discrepancy detector (runs in background) • Transformation engine (applies transforms immediately and in the background)
  6. 6. Discrepancy detection • Performed in the background automatically. • Done by finding suitable structures : • Structure is a string of domains. • Custom domains can be defined. • Find records that do not fit the structure. • Structures can be parameterized : • Can use statistics to compute anomalies.
  7. 7. Structures • What makes a good structure : • Recall (structure matches as many columns as possible) • Precision (structure matches as few as other possible values; avoid overly broad structures) • Conciseness (structure should have minimum length; avoid overfitting) • How is a structure inferred : • Minimum Description Length (MDL) metric.
  8. 8. MDL metric • Distance length (DL) : • Measure used to describe a set of column values, given a structure. • DL(v, S) = (1 – f )(log|ξlen(v)|) + p log m + f (space to express v w. S) recall conciseness precision • Structure inference algorithm : • Enumerate fixed number of structures recursively. • Use structure to compute distance length (DL) measure for all values of a particular column. • Select structure with the lowest DL. • Structure found, thus discrepancies found. What’s next?
  9. 9. Interactive transformations • GUI provided for simple transformations : • Add, drop, copy, fold, etc. • Undo supported. • GUI not possible for complicated transforms : • Splitting.
  10. 10. Splitting • Done by example. • MDL metric used to infer structures. • Once structure is inferred, splitting follows : • Left Right • Decreasing Specificity • Increasing Specificity
  11. 11. Evaluation • Structure inference algorithm works : • Based on examples. • Based on algorithm’s definition. • Decreasing specificity was found to be the faster splitter : • Specificity = sum (DL of example values, given S) • Works best for splits involving many structures. • Inferring structures superior to inferring regular expressions : • Works on custom user-defined domains in a way that is robust to structural data errors.
  12. 12. Conclusion • Potter’s wheel tool : • Interactive • Integrated • Future work : • Transforming nested data • Complex transforms (e.g., Format via examples)
  13. 13. Thank you

×