Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Anton Lytunenko "Data Lake. Make data pleasant to swim in"

17 views

Published on

BigData & Data Engineering

Published in: Environment
  • Be the first to comment

  • Be the first to like this

Anton Lytunenko "Data Lake. Make data pleasant to swim in"

  1. 1. Data Lake Make data pleasant to swim in
  2. 2. Speaker Anton Lytunenko Chief Architect at GreenM Contacts: alytunenko@greenm.org https://www.instagram.com/greenm.rocks/ https://greenm.io
  3. 3. Data Warehouse App/Module 1 Ideal world App/Module 2 App/Module 3 Operational storages One version of the Truth Reporting oriented Data Mart 1 Data Mart 2 Data Mart 3
  4. 4. Real world
  5. 5. From engineering point of view • Multiple independent apps/modules (data stores) • Data is not consistent across apps • Heterogeneous storage types • Different historical data storage policies • Each module has separate team which owns it
  6. 6. From business point of view • Same domain • Same customer base (at least partially) • We have all the necessary data to provide more sophisticated reporting
  7. 7. That results in requests like • Can I build a couple of custom reports for sales demo? • We need more visibility across the board • Hospital X is our major customer, lets give them integral insight
  8. 8. Oh, that is simple! We need MORE ETLs
  9. 9. I like it, but…
  10. 10. Major issues with ETL-based approach • Long feedback loop for customer-facing development (demo, custom reports, on-demand requests) • Consumes a lot of engineering resources • Code-base growth • Unnecessary load on data storages in production • High cost
  11. 11. Seems that we need generic data source • Should contain all of the data • SQL-like interface • Fast enough for on-demand analysis • Cheap
  12. 12. Solution options – Data Warehouse Pros • Storage with predefined structure • One version of truth • Contains all the data • SQL-based interface Cons • Long time to implement • High cost • Requires new development to support new data
  13. 13. Solution options – Data lake BLOB Store SQL-based Engine
  14. 14. Solution options – Data lake Pros • Fast to implement • Cheap • Contains all the data and more • SQL-based interface • Easy to add new data source Cons • Semi-structured data • Can’t be used as a source of truth • Data consistency is poor
  15. 15. Product Domain: Healthcare Product: Provide deep insight into patient experience Project goal: 1. Gather patients feedback 2. Provide deep insight into feedback by means of reporting
  16. 16. DI v3 Product pipeline Sampling Data ingestion Data Mart v3 Data Mart v2 Data Mart v1 Sampling Outreach Feedback Reporting DI v2 DI v1 Outreach v3 Outreach v2 Outreach v1 Feedback
  17. 17. Processing pipeline Active versions Storage Stack Throughput/DB Size Data Ingestion 3 MySQL MS SQL 2012 R2 ~2000K/day ~100 Gb Sampling 1 MS SQL 2005 MS SQL 2008 ~2000K/day Outreach 3 MS SQL 2008 MS SQL 2012 R2 DynamoDB ~300K/day ~600 Gb Feedback 1 MS SQL 2012 R2 DynamoDB ~60K/day ~100 Gb Reporting 3 MS SQL 2012 R2 Vertica 7.x Full refresh daily ~300 Gb MS SQL ~300 Gb Vertica
  18. 18. Here comes The Data Lake
  19. 19. Stage 1 – Fill the lake
  20. 20. Stage 1 – Config
  21. 21. Stage 1 - Results Pros • Took about a month to implement by a team of 2 people • Data is there • Cheap • Fast enough • Analysts team can start playing with it Cons • Data quality • Data is up to yesterday • Overall pipeline performance
  22. 22. Stage 2 – Make it pleasant to swim in Operational storages One version of the Truth Reporting oriented
  23. 23. Stage 2 - Results Cons • Data is up to yesterday • Overall pipeline performance Pros • Implemented by Analysts team with support of 1 engineer • Data quality is there • Doesn’t require engineering involvement to produce new data sets
  24. 24. Applications • Integral view of the platform • Exploration playground for analytics team • Great basis for DM implementation • Debug and troubleshooting
  25. 25. What next? • Integration with NoSQL storages • Incremental load • Streaming directly to DL
  26. 26. Summary • DL core is a BLOB storage + SQL-based engine • DL will not resolve data quality issues automatically • DL is not a replacement for Data Warehouse/Data Mart • There should be a strong use case • It provides a lot of possibilities for exploratory use cases • Gives you cheap and performant enough playground to build new reports • Easily extendable with new data
  27. 27. QUESTIONS?

×