Declarative Data Cleaning :
Language, Model, and
(Galhardas et. Al., Proc. VLDB, 2001)
• Motivating example
• AJAX framework
• Logical layer
• Physical layer
• Neighborhood join algorithm
• Multi-pass neighborhood algorithm
• Related work
• Data cleaning is a difficult problem.
• Current solutions (ETL and reengineering tools) :
• Not sophisticated enough to design data flow graphs
efficiently and effectively.
• Hinder stepwise refinement process crucial to data
• Logical layer :
• Declarative language to express data cleaning using
logical operators (extension of SQL).
• Physical layer :
• Specify algorithm.
• Exceptions as a mechanism to solicit user
• 5 Operations :
• Matching (important)
• Duplicate elimination is handled by a sequence of
match, cluster, and merge.
• Implementations written in 3GL and registered
within the AJAX library.
• Matching algorithms :
• Neighborhood Join optimization (NJ).
• Multi-pass Neighborhood optimization (MPN).
• Apply distance filters on naïve algorithm.
• Devise function over input tuples so that cheaper
to compute similarity than actual similarity.
• E.g, use prefixes of strings
• Actual similarity only computed after passing filter.
• Damerau-Levenshtein for similarity
• Transitive closure.
• NJ does not allow false dismissals.
• MPN relaxes this requirement.
• Algorithm :
• Outer join on relations.
• Select key for each record.
• Sort all keys.
• Compare records that are close; within fixed window.
• Multiple passes allowed.
• MPN faster but less accurate than NJ.
• NJ algorithm is able to achieve a recall of 1 much
faster than the MPN method for more unstructured
• E.g., event name vs author name
• AJAX has more operations than related languages
• SQL doesn’t have merging and clustering operations
or exception support.
• WHIRL doesn’t have merging and clustering.
• AJAX and Potter’s Wheel both interactive.
• Potter’s Wheel automatic discrepancy detection
algorithm can be integrated into AJAX.
• AJAX framework :
• Logical and physical separation.
• Declarative language to specify transformations.
• Exceptions as a way to solicit interactions.