3. Problem
• Data cleaning is an important process.
• Cleaning involves data auditing and data
transformations.
• Current solutions (ETL’s and reengineering tools) :
• Iterative.
• Not interactive.
• Long wait times.
5. Architecture
• 4 parts :
• Data source (tabular, not nested)
• Online reorderer (spreadsheet, sorting, dynamic
display)
• Automatic discrepancy detector (runs in background)
• Transformation engine (applies transforms
immediately and in the background)
6. Discrepancy detection
• Performed in the background automatically.
• Done by finding suitable structures :
• Structure is a string of domains.
• Custom domains can be defined.
• Find records that do not fit the structure.
• Structures can be parameterized :
• Can use statistics to compute anomalies.
7. Structures
• What makes a good structure :
• Recall (structure matches as many columns as
possible)
• Precision (structure matches as few as other
possible values; avoid overly broad structures)
• Conciseness (structure should have minimum length;
avoid overfitting)
• How is a structure inferred :
• Minimum Description Length (MDL) metric.
8. MDL metric
• Distance length (DL) :
• Measure used to describe a set of column values, given a
structure.
• DL(v, S) = (1 – f )(log|ξlen(v)|) + p log m + f (space to express v w.
S)
recall conciseness precision
• Structure inference algorithm :
• Enumerate fixed number of structures recursively.
• Use structure to compute distance length (DL) measure for all
values of a particular column.
• Select structure with the lowest DL.
• Structure found, thus discrepancies found. What’s next?
9. Interactive
transformations
• GUI provided for simple transformations :
• Add, drop, copy, fold, etc.
• Undo supported.
• GUI not possible for complicated transforms :
• Splitting.
10. Splitting
• Done by example.
• MDL metric used to infer structures.
• Once structure is inferred, splitting follows :
• Left Right
• Decreasing Specificity
• Increasing Specificity
11. Evaluation
• Structure inference algorithm works :
• Based on examples.
• Based on algorithm’s definition.
• Decreasing specificity was found to be the faster
splitter :
• Specificity = sum (DL of example values, given S)
• Works best for splits involving many structures.
• Inferring structures superior to inferring regular
expressions :
• Works on custom user-defined domains in a way that is
robust to structural data errors.
12. Conclusion
• Potter’s wheel tool :
• Interactive
• Integrated
• Future work :
• Transforming nested data
• Complex transforms (e.g., Format via examples)