In Data Engineer’s Lunch #5: What is a Data Lake?, we discuss what data lakes are, why we need them, how we get data in and out, and different implementations of data lakes.
In Data Engineer’s Lunch #5: What is a Data Lake?, we discuss what data lakes are, why we need them, how we get data in and out, and different implementations of data lakes. Additional resources can be found in the accompanying blog and SlideShare linked below!
Accompanying Blog: https://blog.anant.us/data-engineers-lunch-5-what-is-a-data-lake/
Accompanying Recording: https://youtu.be/1z3qZVY9aWU
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
3. What is a data lake?
● Data forever in one place
● Raw data stored in objects or files.
○ Structured from relational databases
■ csv
■ tsv
○ Semi-Structured (csv,logs, xml, json)
○ Unstructured data (emails, documents, PDFs)
○ Binary data (images, video, audio)
● On Premise or Cloud
○ HFDS (S3/HDFS/Min.io/DSEFS)
○ Min.io
○ CEPH
4. Why do we need a data lake?
● Can finally do cool stuff with data science
○ Get data into a Data lake
○ Data engineering / wrangling to clean the data
○ Save it back to the data lake
● From : Will Angel
○ Executive memory problem: Many people don't understand that a data-lake can just be BigQuery these days.
Data lake/ data warehouse triggers a lot of PTSD in executives who have lived through bad data
lake/warehouse projects and don't understand that the cost and complexity has come down a lot.
● Question from Will Angel
○ Garbage in Garbage Out: How do we avoid our data lakes turning into data swamps?
○ Answer from Nirmal
■ Stream data in via Kafka ( requires some filtration)
■ Leverage a data catalog (metadata ,schema, name)
○ Other ideas
■ Different data lakes for ingestion , cleaner data , not quite a warehouse
■ Dataset identification / governance
■ Use databricks bronze/silver/gold terminology
5. How do we get data into and out of a data lake?
● Ingress
○ Extract Load Transform
(ELT)
○ Extract Transform Load
(ETL)
○ Stream into it (Kafka, Spark
streaming, Flink, Alpakka)
○ Batch in to it (*, Spark,
Mapreduce, etc.)
● Egress
○ Integration to query engines out of the box
■ Cloud
■ Snowflake
■ Storage : S3/Azure Storage
■ Query : Snowflake Query Language
■ Google BigQuery
■ Storage : Google Storage
■ Query : BigQuery
■ Azure Data Analytics
■ Storage : Azure Storage
■ Query: Azure Data Analytics
■ Amazon Redshift Spectrum
■ Storage : S3
■ Query : SQL
■ Amazon Athena
■ Amazon Glue
■ Open Source
■ Presto
■ Hive
■ SparkSQL / Spark
○ Stream out of it (Spark streaming, Flink, Kafka, Alpakka)
○ Batch out of it (*, Spark, Mapreduce, etc.)
○ Extract Load Transform (ELT)
○ Extract Transform Load (ETL)
6. Implementations
● Original (On Premise)
○ HDFS
○ SAN/NAS
● Open Source
○ Object Storage
■ Min.io
■ CEPH
○ Structured / Formatted Files
■ Parquet
■ JSON
■ CSV
■ XML
■ Delta Lake (Parquet)
○ Structured / Databases
■ BigTable
■ Cassandra
● Cloud
○ S3 / Amazon Athena
○ Azure Data Lake
○ Google Storage / Big Query
○ Snowflake
○ Databricks
7. Resources
● Data lake - Wikipedia
https://en.wikipedia.org/wiki/Data_lake
● Three Reasons to Build a Security Data Lake | by Omer Singer | Medium
https://medium.com/@osinger/three-reasons-to-build-a-security-data-lake-75d74ff10c6a
● Introduction to Azure Data Lake - DZone Big Data
https://dzone.com/articles/introduction-to-azure-data-lake
● What Is a Data Lake and Why Is It Essential for Big Data?
https://learn.g2.com/what-is-a-data-lake
● What is a data lake?
https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
● Cloud Storage as a data lake | Architectures | Google Cloud
https://cloud.google.com/solutions/build-a-data-lake-on-gcp
● Netflix/metacat
https://github.com/Netflix/metacat
8. Strategy: Scalable Fast Data
Architecture: Cassandra, Spark, Kafka
Engineering: Node, Python, JVM,CLR
Operations: Cloud, Container
Rescue: Downtime!! I need help.
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037