Ajax

286 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
286
On SlideShare
0
From Embeds
0
Number of Embeds
84
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Ajax

  1. 1. Declarative Data Cleaning : Language, Model, and Algorithms (Galhardas et. Al., Proc. VLDB, 2001)
  2. 2. Outline • Problem • Motivating example • AJAX framework • Logical layer • Physical layer • Matching • Neighborhood join algorithm • Multi-pass neighborhood algorithm • Evaluation • Related work • Conclusion • Discussion
  3. 3. Problem • Data cleaning is a difficult problem. • Current solutions (ETL and reengineering tools) : • Not sophisticated enough to design data flow graphs efficiently and effectively. • Non-interactive. • Hinder stepwise refinement process crucial to data cleaning.
  4. 4. Motivating Example
  5. 5. AJAX framework • Logical layer : • Declarative language to express data cleaning using logical operators (extension of SQL). • Physical layer : • Specify algorithm. • Optimization. • Exceptions as a mechanism to solicit user interaction.
  6. 6. Logical layer • 5 Operations : • Mapping • View • Matching (important) • Clustering • Merging • Duplicate elimination is handled by a sequence of match, cluster, and merge.
  7. 7. Physical layer • Implementations written in 3GL and registered within the AJAX library. • Matching algorithms : • Naïve. • Neighborhood Join optimization (NJ). • Multi-pass Neighborhood optimization (MPN).
  8. 8. NJ optimization • Apply distance filters on naïve algorithm. • Devise function over input tuples so that cheaper to compute similarity than actual similarity. • E.g, use prefixes of strings • Actual similarity only computed after passing filter. • Damerau-Levenshtein for similarity • Transitive closure.
  9. 9. NJ optimization
  10. 10. MPN optimization • NJ does not allow false dismissals. • MPN relaxes this requirement. • Algorithm : • Outer join on relations. • Select key for each record. • Sort all keys. • Compare records that are close; within fixed window. • Multiple passes allowed.
  11. 11. Evaluation • MPN faster but less accurate than NJ. • NJ algorithm is able to achieve a recall of 1 much faster than the MPN method for more unstructured domains : • E.g., event name vs author name
  12. 12. Related work • AJAX has more operations than related languages : • SQL doesn’t have merging and clustering operations or exception support. • WHIRL doesn’t have merging and clustering. • AJAX and Potter’s Wheel both interactive. • Potter’s Wheel automatic discrepancy detection algorithm can be integrated into AJAX.
  13. 13. Conclusion • AJAX framework : • Logical and physical separation. • Declarative language to specify transformations. • Exceptions as a way to solicit interactions.

×