Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
SampleClean!
Fast and Reliable Analytics
on Dirty Data
Sanjay Krishnan, Jiannan Wang, Michael J. Franklin,
Ken Goldberg, T...
“These bits will only work if you close your eyes and
pick one uniformly at random”
“Let {X1, X2,…, Xn} be a set of i.i.d random variables”
What Real Data Looks Like
Data
ETL CleaningAcquisition Imputation
FeaturesProblem?DataModel
ETL CleaningAcquisition Imputation
FeaturesProblem?DataModel
Constraints
Transformation Budget
Freshness
User Interaction
SampleClean Project
!
• One stack to rule them all: Berkeley Data Analytics Stack
• Bring together database and statistica...
Our Work
SampleClean: Fast and Reliable Analytics on Dirty
Data. IEEE Data Engineering Bul. 2015!
Data Cleaning + Privacy
...
Outline
• Research Overview!
• SampleClean Data Cleaning Library
• Summary
Data Cleaning is Expensive
Basic Idea: Data Scientists
Often Work With Samples
Dirty Data
Dirty
Sample
Clean
Sample
Basic Idea: Data Scientists
Often Work With Samples
Dirty Data
Dirty
Sample
Clean
Sample
Query
Challenges
the	
  palm
9001	
  santa	
  monica	
  
blvd.
los	
  angeles steakhouses
pa8na 5955	
  melrose	
  ave. los	
  a...
Transform Dirty Sample to
Simulate Clean Sample
Cleaned'
Database'
Φ(.)'Dirty&&
Database&
Sample'
Clean&
Sample&
Probabilistic Interpretation
• SUM, COUNT, AVG, VAR can be expressed as a
mean.
• Probabilistic Interpretation: Expected V...
SampleClean Analysis
Let R be dirty relation and C(.) be a row-by-row cleaning function, and
suppose, a user can call C(.)...
Interactive Loop
Data Cleaning Model Training
Test Error
Progressive Cleaning
Progressive Data Cleaning
Machine Learning And
Data Cleaning
Bug #2
Stochastic Gradient Descent
• We know how to get unbiased estimates from cleaned
samples.
!
• Extend to iteration (some cl...
ActiveClean: Architecture
ActiveClean!
Dirty Model!
Current!
Best Model!
Clean Model!
Sampler!
Detector!
Cleaner! Estimato...
Results
0
10
20
30
40
100 1000 10000
ModelError%
# Records Cleaned (Log Scale)
(a) Dollars For Docs
Dirty AC AC+O SC AL
Outline
• Research Overview
• SampleClean Data Cleaning Library!
• Summary
Systems Challenges
Crowd Machine
Learning
Aggregate
Query
Time
• Data cleaning happens at
different time scales.
!
• Compl...
SampleClean Framework
Asynchronous Updates
Cleaning Analysis 2
Working
Set
Analysis 1
Persistence on Apache
Spark
Slow
Iteration 1
Not enough
semantics
Iteration 2 Current
IndexedRDD
Anatomy of a
Data Cleaning Workload
Base Data
De-duplication
Block A
Block B
Block C
Block D
Block A
Block B
Block C
Block...
Benefits of IndexRDD
0!
1000!
2000!
3000!
4000!
5000!
6000!
0! 0.2! 0.4! 0.6! 0.8! 1!
Time(ms)!
Threshold!
Entity Resolutio...
Benefits of IndexRDD
1!
10!
100!
1000!
10000!
100000!
0! 0.2! 0.4! 0.6! 0.8! 1!
Time(Logms)!
Sampling Ratio!
Extraction-Res...
Example De-duplication
Extract address
and category!
Scraped !
Webpages!
Deduplicate
categories!
Cleaned'
restaurant'
data...
Add Crowd Sourcing
Extract address
and category!
Scraped !
Webpages!
Deduplicate
categories!
Cleaned'
restaurant'
dataset'...
Tuning and Optimization
Extract address
and category!
Scraped !
Webpages!
Deduplicate
categories!
Cleaned'
restaurant'
dat...
Add Other Operations
Extract address
and category!
Scraped !
Webpages!
Deduplicate
categories!
Cleaned'
restaurant'
datase...
Leverages Research Results
• Sample-and-clean support
!
• Sample-view-and-clean
!
• Machine Learning
val my_pipeline = res...
Outline
• Research Overview
• SampleClean Data Cleaning Library
• Summary
Summary
• Statistical validity IS a data management problem.
• The design of systems and frameworks can
encourage violatio...
Take Our Survey
sampleclean.org/survey
SampleClean
SampleClean
SampleClean
Upcoming SlideShare
Loading in …5
×

SampleClean

1,442 views

Published on

by Sanjay Krishnan

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

SampleClean

  1. 1. SampleClean! Fast and Reliable Analytics on Dirty Data Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Ken Goldberg, Tim Kraska, Tova Milo, Eugene Wu.
  2. 2. “These bits will only work if you close your eyes and pick one uniformly at random”
  3. 3. “Let {X1, X2,…, Xn} be a set of i.i.d random variables”
  4. 4. What Real Data Looks Like
  5. 5. Data ETL CleaningAcquisition Imputation FeaturesProblem?DataModel
  6. 6. ETL CleaningAcquisition Imputation FeaturesProblem?DataModel Constraints Transformation Budget Freshness User Interaction
  7. 7. SampleClean Project ! • One stack to rule them all: Berkeley Data Analytics Stack • Bring together database and statistical theory to analyze more realistic data processing pipelines. • Data Cleaning library on Spark: sampleclean.org
  8. 8. Our Work SampleClean: Fast and Reliable Analytics on Dirty Data. IEEE Data Engineering Bul. 2015! Data Cleaning + Privacy PrivateClean: Cleaning and Querying Differentially PrivateTables SIGMOD 2016 Data Cleaning + Machine Learning ActiveClean: Progressive Data Cleaning For Convex Data Analytics. Reliable analytics on stale views Stale View Cleaning: Getting Fresh Answers from Stale Materialized Views. VLDB 2015
  9. 9. Outline • Research Overview! • SampleClean Data Cleaning Library • Summary
  10. 10. Data Cleaning is Expensive
  11. 11. Basic Idea: Data Scientists Often Work With Samples Dirty Data Dirty Sample Clean Sample
  12. 12. Basic Idea: Data Scientists Often Work With Samples Dirty Data Dirty Sample Clean Sample Query
  13. 13. Challenges the  palm 9001  santa  monica   blvd. los  angeles steakhouses pa8na 5955  melrose  ave. los  angeles californian philippe's  the  original 1001  n.  alameda  st. chinatown  (la) american pinot  bistro 12969  ventura  blvd. los  angeles french rex  il  ristorante 617  s.  olive  st. los  angeles italian rex  il  ristorante 617  s.  olive  st. los  angeles nuova  cucina  italian Bug #1
  14. 14. Transform Dirty Sample to Simulate Clean Sample Cleaned' Database' Φ(.)'Dirty&& Database& Sample' Clean& Sample&
  15. 15. Probabilistic Interpretation • SUM, COUNT, AVG, VAR can be expressed as a mean. • Probabilistic Interpretation: Expected Values
  16. 16. SampleClean Analysis Let R be dirty relation and C(.) be a row-by-row cleaning function, and suppose, a user can call C(.) k<< |R| times. For SUM, COUNT, AVG, VAR queries with predicates, SampleClean provides a conditionally unbiased estimate of the result with asymptotic error equal to: ! The finite sample error for query is given by: Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Ken Goldberg, Tim Kraska, Tova Milo, Eugene Wu. SampleClean: Fast and Reliable Analytics on Dirty Data. IEEE Data Engineering Bul. 2015
  17. 17. Interactive Loop Data Cleaning Model Training Test Error
  18. 18. Progressive Cleaning
  19. 19. Progressive Data Cleaning
  20. 20. Machine Learning And Data Cleaning Bug #2
  21. 21. Stochastic Gradient Descent • We know how to get unbiased estimates from cleaned samples. ! • Extend to iteration (some clean/some dirty), non- uniform sampling, couple with error detection.
  22. 22. ActiveClean: Architecture ActiveClean! Dirty Model! Current! Best Model! Clean Model! Sampler! Detector! Cleaner! Estimator!Dirty Data! Update'
  23. 23. Results 0 10 20 30 40 100 1000 10000 ModelError% # Records Cleaned (Log Scale) (a) Dollars For Docs Dirty AC AC+O SC AL
  24. 24. Outline • Research Overview • SampleClean Data Cleaning Library! • Summary
  25. 25. Systems Challenges Crowd Machine Learning Aggregate Query Time • Data cleaning happens at different time scales. ! • Complex set of operations— lot of black boxes !
  26. 26. SampleClean Framework
  27. 27. Asynchronous Updates Cleaning Analysis 2 Working Set Analysis 1
  28. 28. Persistence on Apache Spark Slow Iteration 1 Not enough semantics Iteration 2 Current IndexedRDD
  29. 29. Anatomy of a Data Cleaning Workload Base Data De-duplication Block A Block B Block C Block D Block A Block B Block C Block D X X X X Singletons Candidates Easy Singletons Hard
  30. 30. Benefits of IndexRDD 0! 1000! 2000! 3000! 4000! 5000! 6000! 0! 0.2! 0.4! 0.6! 0.8! 1! Time(ms)! Threshold! Entity Resolution-Restaurant! SC-Naïve! SC+Filtering! SC+Filtering+Caching!
  31. 31. Benefits of IndexRDD 1! 10! 100! 1000! 10000! 100000! 0! 0.2! 0.4! 0.6! 0.8! 1! Time(Logms)! Sampling Ratio! Extraction-Restaurant! SC-Naïve! SC+HiveQL! SC+ColumnUpdate!
  32. 32. Example De-duplication Extract address and category! Scraped ! Webpages! Deduplicate categories! Cleaned' restaurant' dataset' val my_data = restaurant.load(schema) .clean( data_cleaning_algo ) val data_cleaning_algo = EntityResolution(sim_func, threshold)
  33. 33. Add Crowd Sourcing Extract address and category! Scraped ! Webpages! Deduplicate categories! Cleaned' restaurant' dataset' val my_data = restaurant.load(schema) .clean( data_cleaning_algo .addCrowdMatcher() )
  34. 34. Tuning and Optimization Extract address and category! Scraped ! Webpages! Deduplicate categories! Cleaned' restaurant' dataset' val my_data = restaurant.load(schema) .clean( data_cleaning_algo .tuneSimilarity(ground_truth) .addCrowdMatcher() )
  35. 35. Add Other Operations Extract address and category! Scraped ! Webpages! Deduplicate categories! Cleaned' restaurant' dataset' val my_data = restaurant.load(schema) .clean( data_cleaning_algo .tuneSimilarity(ground_truth) .addCrowdMatcher() ) .clean( data_cleaning_algo2)
  36. 36. Leverages Research Results • Sample-and-clean support ! • Sample-view-and-clean ! • Machine Learning val my_pipeline = restaurant.load() .sample(0.1) .clean(…) .query(“SELECT COUNT(1) FROM $t”) val my_pipeline = restaurant.load() .sample(0.1) .clean(…) .view(“SELECT * FROM $t WHERE city=“la”) val my_pipeline = restaurant.load() .sample(0.1) .clean(…) .featureView(List(`cat',`label'), (`name’,`bag_of_words')))
  37. 37. Outline • Research Overview • SampleClean Data Cleaning Library • Summary
  38. 38. Summary • Statistical validity IS a data management problem. • The design of systems and frameworks can encourage violation of technical assumptions. • In many cases…relatively straightforward ways to correct for these issues. SIGMOD 14, VLDB 15a, VLDB 15b, IEEE Data Eng. Bul. 2015, SIGMOD 16*
  39. 39. Take Our Survey sampleclean.org/survey

×