Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SampleClean

1,343 views

Published on

by Sanjay Krishnan

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

SampleClean

  1. 1. SampleClean! Fast and Reliable Analytics on Dirty Data Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Ken Goldberg, Tim Kraska, Tova Milo, Eugene Wu.
  2. 2. “These bits will only work if you close your eyes and pick one uniformly at random”
  3. 3. “Let {X1, X2,…, Xn} be a set of i.i.d random variables”
  4. 4. What Real Data Looks Like
  5. 5. Data ETL CleaningAcquisition Imputation FeaturesProblem?DataModel
  6. 6. ETL CleaningAcquisition Imputation FeaturesProblem?DataModel Constraints Transformation Budget Freshness User Interaction
  7. 7. SampleClean Project ! • One stack to rule them all: Berkeley Data Analytics Stack • Bring together database and statistical theory to analyze more realistic data processing pipelines. • Data Cleaning library on Spark: sampleclean.org
  8. 8. Our Work SampleClean: Fast and Reliable Analytics on Dirty Data. IEEE Data Engineering Bul. 2015! Data Cleaning + Privacy PrivateClean: Cleaning and Querying Differentially PrivateTables SIGMOD 2016 Data Cleaning + Machine Learning ActiveClean: Progressive Data Cleaning For Convex Data Analytics. Reliable analytics on stale views Stale View Cleaning: Getting Fresh Answers from Stale Materialized Views. VLDB 2015
  9. 9. Outline • Research Overview! • SampleClean Data Cleaning Library • Summary
  10. 10. Data Cleaning is Expensive
  11. 11. Basic Idea: Data Scientists Often Work With Samples Dirty Data Dirty Sample Clean Sample
  12. 12. Basic Idea: Data Scientists Often Work With Samples Dirty Data Dirty Sample Clean Sample Query
  13. 13. Challenges the  palm 9001  santa  monica   blvd. los  angeles steakhouses pa8na 5955  melrose  ave. los  angeles californian philippe's  the  original 1001  n.  alameda  st. chinatown  (la) american pinot  bistro 12969  ventura  blvd. los  angeles french rex  il  ristorante 617  s.  olive  st. los  angeles italian rex  il  ristorante 617  s.  olive  st. los  angeles nuova  cucina  italian Bug #1
  14. 14. Transform Dirty Sample to Simulate Clean Sample Cleaned' Database' Φ(.)'Dirty&& Database& Sample' Clean& Sample&
  15. 15. Probabilistic Interpretation • SUM, COUNT, AVG, VAR can be expressed as a mean. • Probabilistic Interpretation: Expected Values
  16. 16. SampleClean Analysis Let R be dirty relation and C(.) be a row-by-row cleaning function, and suppose, a user can call C(.) k<< |R| times. For SUM, COUNT, AVG, VAR queries with predicates, SampleClean provides a conditionally unbiased estimate of the result with asymptotic error equal to: ! The finite sample error for query is given by: Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Ken Goldberg, Tim Kraska, Tova Milo, Eugene Wu. SampleClean: Fast and Reliable Analytics on Dirty Data. IEEE Data Engineering Bul. 2015
  17. 17. Interactive Loop Data Cleaning Model Training Test Error
  18. 18. Progressive Cleaning
  19. 19. Progressive Data Cleaning
  20. 20. Machine Learning And Data Cleaning Bug #2
  21. 21. Stochastic Gradient Descent • We know how to get unbiased estimates from cleaned samples. ! • Extend to iteration (some clean/some dirty), non- uniform sampling, couple with error detection.
  22. 22. ActiveClean: Architecture ActiveClean! Dirty Model! Current! Best Model! Clean Model! Sampler! Detector! Cleaner! Estimator!Dirty Data! Update'
  23. 23. Results 0 10 20 30 40 100 1000 10000 ModelError% # Records Cleaned (Log Scale) (a) Dollars For Docs Dirty AC AC+O SC AL
  24. 24. Outline • Research Overview • SampleClean Data Cleaning Library! • Summary
  25. 25. Systems Challenges Crowd Machine Learning Aggregate Query Time • Data cleaning happens at different time scales. ! • Complex set of operations— lot of black boxes !
  26. 26. SampleClean Framework
  27. 27. Asynchronous Updates Cleaning Analysis 2 Working Set Analysis 1
  28. 28. Persistence on Apache Spark Slow Iteration 1 Not enough semantics Iteration 2 Current IndexedRDD
  29. 29. Anatomy of a Data Cleaning Workload Base Data De-duplication Block A Block B Block C Block D Block A Block B Block C Block D X X X X Singletons Candidates Easy Singletons Hard
  30. 30. Benefits of IndexRDD 0! 1000! 2000! 3000! 4000! 5000! 6000! 0! 0.2! 0.4! 0.6! 0.8! 1! Time(ms)! Threshold! Entity Resolution-Restaurant! SC-Naïve! SC+Filtering! SC+Filtering+Caching!
  31. 31. Benefits of IndexRDD 1! 10! 100! 1000! 10000! 100000! 0! 0.2! 0.4! 0.6! 0.8! 1! Time(Logms)! Sampling Ratio! Extraction-Restaurant! SC-Naïve! SC+HiveQL! SC+ColumnUpdate!
  32. 32. Example De-duplication Extract address and category! Scraped ! Webpages! Deduplicate categories! Cleaned' restaurant' dataset' val my_data = restaurant.load(schema) .clean( data_cleaning_algo ) val data_cleaning_algo = EntityResolution(sim_func, threshold)
  33. 33. Add Crowd Sourcing Extract address and category! Scraped ! Webpages! Deduplicate categories! Cleaned' restaurant' dataset' val my_data = restaurant.load(schema) .clean( data_cleaning_algo .addCrowdMatcher() )
  34. 34. Tuning and Optimization Extract address and category! Scraped ! Webpages! Deduplicate categories! Cleaned' restaurant' dataset' val my_data = restaurant.load(schema) .clean( data_cleaning_algo .tuneSimilarity(ground_truth) .addCrowdMatcher() )
  35. 35. Add Other Operations Extract address and category! Scraped ! Webpages! Deduplicate categories! Cleaned' restaurant' dataset' val my_data = restaurant.load(schema) .clean( data_cleaning_algo .tuneSimilarity(ground_truth) .addCrowdMatcher() ) .clean( data_cleaning_algo2)
  36. 36. Leverages Research Results • Sample-and-clean support ! • Sample-view-and-clean ! • Machine Learning val my_pipeline = restaurant.load() .sample(0.1) .clean(…) .query(“SELECT COUNT(1) FROM $t”) val my_pipeline = restaurant.load() .sample(0.1) .clean(…) .view(“SELECT * FROM $t WHERE city=“la”) val my_pipeline = restaurant.load() .sample(0.1) .clean(…) .featureView(List(`cat',`label'), (`name’,`bag_of_words')))
  37. 37. Outline • Research Overview • SampleClean Data Cleaning Library • Summary
  38. 38. Summary • Statistical validity IS a data management problem. • The design of systems and frameworks can encourage violation of technical assumptions. • In many cases…relatively straightforward ways to correct for these issues. SIGMOD 14, VLDB 15a, VLDB 15b, IEEE Data Eng. Bul. 2015, SIGMOD 16*
  39. 39. Take Our Survey sampleclean.org/survey

×