Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SampleClean: Bringing Data Cleaning into the BDAS Stack

3,453 views

Published on

"SampleClean: Bringing Data Cleaning into the BDAS Stack" presentation at AMPCamp 5 by Sanjay Krishnan and Daniel Haas

Published in: Software
  • Be the first to comment

SampleClean: Bringing Data Cleaning into the BDAS Stack

  1. 1. SampleClean: Bringing Data Cleaning into the BDAS Stack! Sanjay Krishnan and Daniel Haas! In Collaboration With: Juan Sanchez, Wenbo Tao, Jiannan Wang, Tim Kraska, Michael Franklin, Tova Milo, Ken Goldberg !
  2. 2. Who publishes more? ! ! ! 2
  3. 3. Microsoft Academic Search! ! ! Paper Id! Affiliation! 16! Computer Science Division--University of California Berkeley CA! 101! University of California at Berkeley! 102! Department of Physics Stanford ! University California! 116! Lawrence Berkeley National Labs! <ref>California</ref>! 3
  4. 4. Microsoft Academic Search! ! ! Paper Id! Affiliation! 16! Computer Science Division--University of California Berkeley CA! 101! University of California at Berkeley! 102! Department of Physics Stanford ! University California! 116! Lawrence Berkeley National Labs! <ref>California</ref>! X 4
  5. 5. Microsoft Academic Search! ! ! University of California at Berkeley! Computer Science Division! University of California at Berkeley! Department of Physics Stanford ! University California! 5
  6. 6. • Data cleaning in BDAS.! – Problem 1. Scale! – Problem 2. Latency! ! • Sampling to cope with scale.! • Asynchrony to cope with latency.! ! Enter SampleClean! 6
  7. 7. Now it’s your turn!! Be the crowd and help us decide! ! ! 7
  8. 8. Dirty Data is Ubiquitous! 8! Example: Missing, incomplete, inconsistent data!
  9. 9. Data Cleaning is Hard! 9 Time consuming!
  10. 10. Data Cleaning is Hard! 10 Time consuming! Costly!
  11. 11. Data Cleaning is Hard! 11 Time consuming! Costly! Domain-specific!
  12. 12. Data Cleaning is Hard! 12 Time consuming! Costly! Domain-specific!
  13. 13. A New Data Cleaning Architecture! Analy0cs 13 Data Data Cleaning
  14. 14. A New Data Cleaning Architecture! Analy0cs 14 Data Cleaning Data
  15. 15. Can it Scale?! People are slow and expensive! Crowd Machine Learning Regex Time 15
  16. 16. Insight 1: Asynchrony Hides Latency! 16
  17. 17. Insight 2: Sampling Hides Scale! Query ! Error! BlinkDB! Time! 17
  18. 18. Insight 2: Sampling Hides Scale! Query ! Error! Time! Data Error BlinkDB! 18
  19. 19. Insight 2: Sampling Hides Scale! Query ! Error! Time! Data Error SampleClean! BlinkDB! 19
  20. 20. SampleClean Data Flow! Dirty Data Dirty Sample Query Clean Sample Data Cleaning 20 Sampling Asynchrony
  21. 21. SampleClean Data Flow! Query Clean Sample Data Cleaning Asynchrony 21
  22. 22. The SampleClean Architecture! Data Cleaning Library Issue Queries, ! Get Results! Approximate Asynchronous Query Processing Pipelines Clean Sample Declare Cleaning ! Operations! Dirty Sample 22
  23. 23. The SampleClean Architecture! Data Cleaning Library Issue Queries, ! Get Results! Approximate Asynchronous Query Processing Pipelines Clean Sample Declare Cleaning ! Operations! Dirty Sample 23
  24. 24. Approximate Query Processing! • Estimate early results and bound with error bars! Query ! Error! Time! SampleClean: Fast and Accurate Query Processing on Dirty Data. SIGMOD 2014! ! BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. EuroSys 2013! 24
  25. 25. The SampleClean Architecture! 25 Issue Queries, ! Get Results! Approximate Asynchronous Query Processing Pipelines Clean Sample Declare Cleaning ! Operations! Dirty Sample Data Cleaning Library
  26. 26. Crowds and Machines Work Together! • Extensible library of data cleaning tools! • Tools are:! – Automated! – Human-powered! – Hybrid! ! Crowd Machine Learning Regex Time 26
  27. 27. Active Learning and Crowds! • Choose informative training points! Not ! Informative! Are these the same?! Stanford Department of IEOR! ! UC Berkeley Stats! ! ¢ Yes ! ¤ No! Informative! Are these the same?! Department of Mathematics Stanford University! ! University of California Berkeley Department of Mathematics! ! ¢ Yes ! ¤ No! 27
  28. 28. Active Learning and Crowds! • Choose informative training points! Not ! Informative! Are these the same?! Stanford Department of IEOR! ! UC Berkeley Stats! ! ¢ Yes ! ¤ No! Informative! Are these the same?! Department of Mathematics Stanford University! ! University of California Berkeley Department of Mathematics! ! ¢ Yes ! ¤ No! 28
  29. 29. The SampleClean Architecture! 29 Data Cleaning Library Issue Queries, ! Get Results! Clean Sample Declare Cleaning ! Operations! Dirty Sample Approximate Asynchronous Query Processing Pipelines
  30. 30. Putting it all together: Asynchronous Pipelines! • Users group data cleaning operations into pipelines! 30
  31. 31. The SampleClean Architecture! Data Cleaning Library Issue Queries, ! Get Results! Approximate Asynchronous Query Processing Pipelines Clean Sample Declare Cleaning ! Operations! Dirty Sample 31
  32. 32. Great, Now What?! • Prototype implementation complete!! • Significant research challenges remain:! • Crowd worker performance and quality! • Pipeline semantics and optimization! • Programming model and interface! ! • Open source release targeted for next year! 32
  33. 33. Summary! • Data Cleaning is slow, costly, and domain-specific! • SampleClean brings data cleaning into the BDAS stack ! • SampleClean uses asynchrony to hide latency, and sampling to hide scale! • SampleClean combines Algorithms, Machines, and People, all in one system! 33
  34. 34. Asynchrony in Spark! • The Spark abstraction: blocking BSP! • So how do we achieve asynchrony?! • Multithreaded master! • Intermediate results materialized in Hive! • Standalone Finagle HTTP server for crowd work! ! 34

×