Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

TechEvent Building a Data Lake


Published on

Building a Data Lake - Lessions learned

Published in: Technology
  • Be the first to comment

  • Be the first to like this

TechEvent Building a Data Lake

  2. 2. Markus Heinisch Build a Data Lake2 01.10.2018 Principal Consultant and Disciplin Manager Focus: – Agile Methods – Project Manager – Architecture and Technology (current topics: Serverless, Angular, NodeJS, Java) – Coaching
  3. 3. Florian Feicht Build a Data Lake3 01.10.2018 Team Leader, Senior Consultant and Trainer at Trivadis GmbH Focus: – Agile Projects – Cloud technologies – Big Data Adminstration – Oracle DBA/Linux Administration
  4. 4. Agenda Build a Data Lake4 01.10.2018 1. Project overview 2. Data Lake Definition 3. Start of the project 4. Challenges 5. Risks 6. Conclusion
  5. 5. Build a Data Lake5 01.10.2018 About the project
  6. 6. Project (I) Build a Data Lake6 01.10.2018 Technical – Build a data lake – Around 720 data sources – Confidential and secret data – Public cloud as target platform Organizational – Scrum orientated organization – Very limited time by PO
  7. 7. Project (II) Build a Data Lake7 01.10.2018 Two in one project – Build the data lake – Ensure that the first consumer is happy – Connect the data sources in a POC state if necessary
  8. 8. Build a Data Lake8 01.10.2018 Data Lake Definition
  9. 9. Data Lake definition Build a Data Lake9 10/1/2018 Wikipedia A data lake is a system or repository of data stored in its natural format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning. Microsoft Azure
  10. 10. What do you store in a Data Lake? (I) Build a Data Lake10 01.10.2018 Structured data Semi-structured data
  11. 11. What do you store in a Data Lake? (II) Build a Data Lake11 01.10.2018 Unstructured data Binary data
  12. 12. Trivadis Data Lake Reference Architecture Build a Data Lake12 01.10.2018
  13. 13. Build a Data Lake13 01.10.2018 Start
  14. 14. Start the project - Requirements Build a Data Lake14 01.10.2018 Start as soon as possible Build a „cloud native“ application on premises Lift and shift to public cloud should be possible
  15. 15. Start the project – Used solution Build a Data Lake15 01.10.2018 Use technologies which are available in the cloud as well Orchestration only as much as absolutely necessary Containerize all your components
  16. 16. Build a Data Lake16 01.10.2018 Challenges - The „easy“ part
  17. 17. Raw data storage - Requirements Build a Data Lake17 01.10.2018 – Scalable storage – „cheap“ storage – Cloud Ready – Performance depends on your design and needs
  18. 18. Raw Data Storage Build a Data Lake18 01.10.2018
  19. 19. Raw Data Storage – Used solution Build a Data Lake19 01.10.2018 Object store – Scalable and cheap storage – S3 compliant object stores are available on most of the public clouds – On premises object store was already up and running – Be careful, because • Not a file system • No POSIX attributes • No good file browser available
  20. 20. Data Ingest - Requirements Build a Data Lake20 01.10.2018 Tool based dataflow management Flexible enough to get data from many source systems Queue data before ingest process Take care about the history of data dataflow management
  21. 21. Data Ingest – Used solution Build a Data Lake21 01.10.2018 StreamSets – Used also in other Trivadis projects – Quite stable – Complex to control multiple pipelines without license – Workarounds necessary Alternatives – Talend – Apache NiFi – …
  22. 22. Data access – Requirements Build a Data Lake22 01.10.2018 – Explorative access – High-performance access (Web UI) – If possible SQL like query language – Reachable via REST API – Analytics (e.g. Spark) not necessary in the first step – Security (authentication and authorization) Access layer
  23. 23. Data access – Used solution Data Lake Build a Data Lake23 01.10.2018 Explorative access – Apache Drill • Schema less SQL engine • Structured and semi structured data • REST API
  24. 24. Build a Data Lake24 01.10.2018 Challenges - The „a bit complex“ part
  25. 25. Data access - Used solution First consumer Build a Data Lake25 01.10.2018 High performance, low latency Used tools are very use case specific – Kafka + MongoDB + RestHeart • Mainly JSON documents • Very fast access to the data • RestHeart provides a REST API for Mongo
  26. 26. Data GovernanceMaster Data Master Data Data CatalogData Catalog API Sync Business Glossary Big Data Storage & Processing / Analytics & ML / Data Consumption Retrieve Master Data Update Catalog Master Data API Master Data Application Data Catalog Application Integration CDC Event Hub Event Handler Data Archival Bulk Data Flow Data Processing Data Science / Data Visualization Scheduler Object Storage NoSQL SQL Database Analytics / ML ingest outgest ingest outgest SQL Access Data Sources FileRDBMS ERP API Connected Car Wiki Test Car Data Flow Data Lake Apps Sources S3 Storage Spark Drill Kafka IaaS / Container IaaS / Container StreamSets Docker Compose Zeppelin or Jupyter Apache Kudu Consumer facing Data Lake - Access Tomcat HDFS HA-Proxy Tivoli Access Manager Consumer Applications API DashboardMobile Microservice Local Data Consumer facing Data Lake – Data Storage MySQLMongo-DBS3 Storage RestHeart Elastic- Search StreamSets StreamSets cronjob 01.10.2018 Build a Data Lake26
  27. 27. Open Source tools – Check list Build a Data Lake27 01.10.2018 Is the chosen tool not only cool but does it fit in your environment? Is a (good) documentation maintained? Is this still an active project (frequent releases)? Is there an active community? Enterprise support available? What’s about security?
  28. 28. Build a Data Lake28 01.10.2018 Challenges - The „complex“ part
  29. 29. Source systems Build a Data Lake29 01.10.2018 Requirements – Connect hundred of systems – Mostly legacy systems with poor interfaces – Legal restrictions to consume sensible data
  30. 30. Source systems Build a Data Lake30 01.10.2018 Never underestimate their – organizational – technical complexity
  31. 31. Source Systems – Proposed solution Build a Data Lake31 01.10.2018 Very good organization and support from your customer is key Request access and accounts as early as possible Ask if there is already an used interface for other use cases Business knowledge helps a lot
  32. 32. Build a Data Lake32 01.10.2018 Conclusion
  33. 33. Conclusions Build a Data Lake33 01.10.2018 Talk to your customer what his interpretation of a „Data Lake“ is Biggest problem is to integrate source systems (free) Open Source tools are nice, but verify if they are enterprise ready There is no One-Size-Fits-All data lake implementation
  34. 34. Markus Heinisch Florian Feicht 01.10.2018 Build a Data Lake34