Upcoming SlideShare
×

# Potters wheel

647 views
480 views

Published on

1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
647
On SlideShare
0
From Embeds
0
Number of Embeds
97
Actions
Shares
0
2
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Potters wheel

1. 1. Potter's Wheel : An Interactive Data Cleaning System (Raman and Hellerstein, Proc. VLDB, 2001)
2. 2. Outline • Problem • Potter’s wheel • Architecture • Discrepancy detection • Structures • MDL metric • Interactive Transformations • Splitting • Evaluation • Conclusion • Discussion
3. 3. Problem • Data cleaning is an important process. • Cleaning involves data auditing and data transformations. • Current solutions (ETL’s and reengineering tools) : • Iterative. • Not interactive. • Long wait times.
4. 4. Potter’s wheel • Integrates transformation and discrepancy detection. • Interactive transformations. • Reduced wait times.
5. 5. Architecture • 4 parts : • Data source (tabular, not nested) • Online reorderer (spreadsheet, sorting, dynamic display) • Automatic discrepancy detector (runs in background) • Transformation engine (applies transforms immediately and in the background)
6. 6. Discrepancy detection • Performed in the background automatically. • Done by finding suitable structures : • Structure is a string of domains. • Custom domains can be defined. • Find records that do not fit the structure. • Structures can be parameterized : • Can use statistics to compute anomalies.
7. 7. Structures • What makes a good structure : • Recall (structure matches as many columns as possible) • Precision (structure matches as few as other possible values; avoid overly broad structures) • Conciseness (structure should have minimum length; avoid overfitting) • How is a structure inferred : • Minimum Description Length (MDL) metric.
8. 8. MDL metric • Distance length (DL) : • Measure used to describe a set of column values, given a structure. • DL(v, S) = (1 – f )(log|ξlen(v)|) + p log m + f (space to express v w. S) recall conciseness precision • Structure inference algorithm : • Enumerate fixed number of structures recursively. • Use structure to compute distance length (DL) measure for all values of a particular column. • Select structure with the lowest DL. • Structure found, thus discrepancies found. What’s next?
9. 9. Interactive transformations • GUI provided for simple transformations : • Add, drop, copy, fold, etc. • Undo supported. • GUI not possible for complicated transforms : • Splitting.
10. 10. Splitting • Done by example. • MDL metric used to infer structures. • Once structure is inferred, splitting follows : • Left Right • Decreasing Specificity • Increasing Specificity
11. 11. Evaluation • Structure inference algorithm works : • Based on examples. • Based on algorithm’s definition. • Decreasing specificity was found to be the faster splitter : • Specificity = sum (DL of example values, given S) • Works best for splits involving many structures. • Inferring structures superior to inferring regular expressions : • Works on custom user-defined domains in a way that is robust to structural data errors.
12. 12. Conclusion • Potter’s wheel tool : • Interactive • Integrated • Future work : • Transforming nested data • Complex transforms (e.g., Format via examples)
13. 13. Thank you