Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Fountain of Youth or Polluted Swamp: Is your data lake revitalizing your business or eroding the foundation?

280 views

Published on

by Vincent Yates
Director of Analytic Engineering at Zillow Group

Fountain of Youth or Polluted Swamp: Is your data lake revitalizing your business or eroding the foundation?

We’ve all been promised the shangri-la that is data lakes: more data means more insights—synergy! But has it really panned out? The trouble is that data lakes are more like the early days of the internet than they are a panacea of pristine useful information. Anyone can publish data, and even when they have the best of intentions, priorities shift, people leave and ultimately the priceless data become worthless. Those data may have been reliable when they were first published but are now wrong. Yet like many stale webpages, there is no way to tell, and the business continues to rely on those wrong data to make decisions. We at Zillow faced the same problem and decided to change it. I will describe the tools we’ve built and the tenants behind our team to help you ensure your lake rejuvenates your organization. Einstein said it best, “whoever is careless with the truth in small matters cannot be trusted with important matters.”

Published in: Technology
  • Be the first to comment

Fountain of Youth or Polluted Swamp: Is your data lake revitalizing your business or eroding the foundation?

  1. 1. 11 ZILLOW | TRULIA | STREETEASY | HOTPADS | NAKED APARTMENTS Vincent Yates, Director of Analytics Engineering @VincentYates8 FOUNTAIN OF YOUTH OR POLLUTED SWAMP: IS YOUR DATA LAKE REVITALIZING YOUR BUSINESS OR ERODING THE FOUNDATION?
  2. 2. 2 One of these is worth $42,000 more Finished sq- ft 2,602 2,602 Lot Size 4,400 5,342 Bathrooms 3 3 Bedrooms 4 4 Year Built 2004 2005 Sale Price 861,000 819,000
  3. 3. 3 One of these is worth $164,000 more Finished sq- ft 1,620 1,620 Lot Size 1,620 1,620 Bathrooms 2.5 3 Bedrooms 3 3 Year Built 2007 2007 Sale Price 499,000 663,000
  4. 4. 4 One of these is worth >$10M annually http://www.exp-platform.com/Pages/SevenRulesofThumbforWebSiteExperimenters.aspx
  5. 5. 55 DATA SCIENCE’S DIRTY LITTLE SECRET
  6. 6. 66 $3.1 TRILLION IBM Big Data Hub
  7. 7. 7 Unknowns ≠ Seasonality Seasonality Seasonality Seasonality Seasonalit y Seasonality Seasonality
  8. 8. 88 Seriously DATA SCIENCE IS HARD
  9. 9. 9 Product & Communicatio n Programming Statistics
  10. 10. 1010 24% of data scientists UNSURE OF HOW MUCH OF THEIR DATA ARE INACCURATE IBM Big Data Hub
  11. 11. 11 Errors Propagate in Dynamic Ways
  12. 12. 12
  13. 13. 1313 66% of data scientists CLEANING DATA IS THE MOST TIME CONSUMING TASK CroundFlower 2015 Data Science Report
  14. 14. 1414 My data is pretty good. DOES IT REALLY MATTER?
  15. 15. 1515 52.3% of data scientists POOR DATA QUALITY IS THEIR BIGGEST HURDLE CroundFlower 2015 Data Science Report
  16. 16. 1616 The cost of poor data quality 15-25% OF OPERATING PROFIT Kaufman,Morgan: The Accuracy Dimension
  17. 17. 1717 Someone would have noticed and fixed it HOW DID WE GET HERE?
  18. 18. 18 Cracks start to show under pressure Data Quality: The Accuracy Dimension The Morgan Kaufmann Series in Data Management Systems OperationalIntegration Replication
  19. 19. 19 Complexity/Agility is the scapegoat Transaction applications, APIs, Third- party data producers Transactio n databases Data Marts Data Lake
  20. 20. 20 Complexity/Agility is the scapegoat Transaction applications, APIs, Third- party data producers Transactio n databases Data Marts Data Lake
  21. 21. 21 Complexity/Agility is the scapegoat Transaction applications, APIs, Third- party data producers Transactio n databases Data Marts Data Lake
  22. 22. 22 Complexity/Agility is the scapegoat
  23. 23. 23 Complexity/Agility is the scapegoat
  24. 24. 24 Complexity/Agility is the scapegoat Data Marts Data Lake
  25. 25. 25 Moral Hazard is the culprit
  26. 26. 2626 HOW DO WE GET OUT? A few simple tricks to head in the right direction
  27. 27. 2727 PROACTIVE NOT REACTIVE Data scientist is not great under duress
  28. 28. 28 Get Back to Raw Data
  29. 29. 29 Centralize Definitions
  30. 30. 30 Model Where Possible
  31. 31. 3131 MODELING IS HARD Build tools to make reactive easier
  32. 32. 32
  33. 33. 33
  34. 34. 34 Data Problems are as Old as Data
  35. 35. 35 Many mistakes are required for catastrophe • Climate caused more icebergs – Ignored Forecasts • Tides sent icebergs southward – Poor/Wrong Measurement • The ship was going too fast – Business needs over best data • Iceberg warnings went unheeded – Data was Disregarded for Intuition • The binoculars were locked up – Tools were behind lock and key • The steersman took a wrong turn – Reactive action under stress lead to wrong decisions • The iron rivets were too weak – Cost savings over best data • There were too few lifeboats – Marketing owned the message http://cosmiclog.nbcnews.com/_news/2012/04/01/10970732-10-causes-of-the-titanic-tragedy
  36. 36. 3636 VincentYa@zillowgroup.com @VincentYates8 THANK YOU!

×