SlideShare a Scribd company logo
1 of 8
Download to read offline
Version 1.0
What is a Data Lake?
An Anant Corporation Story.
Topics
● Core Concepts
● Implementations
● Resources
What is a data lake?
● Data forever in one place
● Raw data stored in objects or files.
○ Structured from relational databases
■ csv
■ tsv
○ Semi-Structured (csv,logs, xml, json)
○ Unstructured data (emails, documents, PDFs)
○ Binary data (images, video, audio)
● On Premise or Cloud
○ HFDS (S3/HDFS/Min.io/DSEFS)
○ Min.io
○ CEPH
Why do we need a data lake?
● Can finally do cool stuff with data science
○ Get data into a Data lake
○ Data engineering / wrangling to clean the data
○ Save it back to the data lake
● From : Will Angel
○ Executive memory problem: Many people don't understand that a data-lake can just be BigQuery these days.
Data lake/ data warehouse triggers a lot of PTSD in executives who have lived through bad data
lake/warehouse projects and don't understand that the cost and complexity has come down a lot.
● Question from Will Angel
○ Garbage in Garbage Out: How do we avoid our data lakes turning into data swamps?
○ Answer from Nirmal
■ Stream data in via Kafka ( requires some filtration)
■ Leverage a data catalog (metadata ,schema, name)
○ Other ideas
■ Different data lakes for ingestion , cleaner data , not quite a warehouse
■ Dataset identification / governance
■ Use databricks bronze/silver/gold terminology
How do we get data into and out of a data lake?
● Ingress
○ Extract Load Transform
(ELT)
○ Extract Transform Load
(ETL)
○ Stream into it (Kafka, Spark
streaming, Flink, Alpakka)
○ Batch in to it (*, Spark,
Mapreduce, etc.)
● Egress
○ Integration to query engines out of the box
■ Cloud
■ Snowflake
■ Storage : S3/Azure Storage
■ Query : Snowflake Query Language
■ Google BigQuery
■ Storage : Google Storage
■ Query : BigQuery
■ Azure Data Analytics
■ Storage : Azure Storage
■ Query: Azure Data Analytics
■ Amazon Redshift Spectrum
■ Storage : S3
■ Query : SQL
■ Amazon Athena
■ Amazon Glue
■ Open Source
■ Presto
■ Hive
■ SparkSQL / Spark
○ Stream out of it (Spark streaming, Flink, Kafka, Alpakka)
○ Batch out of it (*, Spark, Mapreduce, etc.)
○ Extract Load Transform (ELT)
○ Extract Transform Load (ETL)
Implementations
● Original (On Premise)
○ HDFS
○ SAN/NAS
● Open Source
○ Object Storage
■ Min.io
■ CEPH
○ Structured / Formatted Files
■ Parquet
■ JSON
■ CSV
■ XML
■ Delta Lake (Parquet)
○ Structured / Databases
■ BigTable
■ Cassandra
● Cloud
○ S3 / Amazon Athena
○ Azure Data Lake
○ Google Storage / Big Query
○ Snowflake
○ Databricks
Resources
● Data lake - Wikipedia
https://en.wikipedia.org/wiki/Data_lake
● Three Reasons to Build a Security Data Lake | by Omer Singer | Medium
https://medium.com/@osinger/three-reasons-to-build-a-security-data-lake-75d74ff10c6a
● Introduction to Azure Data Lake - DZone Big Data
https://dzone.com/articles/introduction-to-azure-data-lake
● What Is a Data Lake and Why Is It Essential for Big Data?
https://learn.g2.com/what-is-a-data-lake
● What is a data lake?
https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
● Cloud Storage as a data lake | Architectures | Google Cloud
https://cloud.google.com/solutions/build-a-data-lake-on-gcp
● Netflix/metacat
https://github.com/Netflix/metacat
Strategy: Scalable Fast Data
Architecture: Cassandra, Spark, Kafka
Engineering: Node, Python, JVM,CLR
Operations: Cloud, Container
Rescue: Downtime!! I need help.
 www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037

More Related Content

More from Anant Corporation

Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation
 

More from Anant Corporation (20)

Cassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksCassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward Talks
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionData Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & FutureCassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
CL 121
CL 121CL 121
CL 121
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsApache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraApache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
 
Data Engineer’s Lunch #67: Machine Learning - Feature Selection
Data Engineer’s Lunch #67: Machine Learning - Feature SelectionData Engineer’s Lunch #67: Machine Learning - Feature Selection
Data Engineer’s Lunch #67: Machine Learning - Feature Selection
 
Data Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource ManagersData Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource Managers
 
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
 
Data Engineer's Lunch #76: Airflow and Google Dataproc
Data Engineer's Lunch #76: Airflow and Google DataprocData Engineer's Lunch #76: Airflow and Google Dataproc
Data Engineer's Lunch #76: Airflow and Google Dataproc
 
Apache Cassandra Lunch #115: Google Dataproc and DataStax Astra
Apache Cassandra Lunch #115: Google Dataproc and DataStax AstraApache Cassandra Lunch #115: Google Dataproc and DataStax Astra
Apache Cassandra Lunch #115: Google Dataproc and DataStax Astra
 
Apache Cassandra Lunch #114: Cassandra Virtual Tables
Apache Cassandra Lunch #114: Cassandra Virtual TablesApache Cassandra Lunch #114: Cassandra Virtual Tables
Apache Cassandra Lunch #114: Cassandra Virtual Tables
 
Apache Cassandra Lunch #110: Full Query Logging
Apache Cassandra Lunch #110: Full Query LoggingApache Cassandra Lunch #110: Full Query Logging
Apache Cassandra Lunch #110: Full Query Logging
 

Recently uploaded

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 

Recently uploaded (20)

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 

Data Engineer's Lunch #5: What is a Data Lake?

  • 1. Version 1.0 What is a Data Lake? An Anant Corporation Story.
  • 2. Topics ● Core Concepts ● Implementations ● Resources
  • 3. What is a data lake? ● Data forever in one place ● Raw data stored in objects or files. ○ Structured from relational databases ■ csv ■ tsv ○ Semi-Structured (csv,logs, xml, json) ○ Unstructured data (emails, documents, PDFs) ○ Binary data (images, video, audio) ● On Premise or Cloud ○ HFDS (S3/HDFS/Min.io/DSEFS) ○ Min.io ○ CEPH
  • 4. Why do we need a data lake? ● Can finally do cool stuff with data science ○ Get data into a Data lake ○ Data engineering / wrangling to clean the data ○ Save it back to the data lake ● From : Will Angel ○ Executive memory problem: Many people don't understand that a data-lake can just be BigQuery these days. Data lake/ data warehouse triggers a lot of PTSD in executives who have lived through bad data lake/warehouse projects and don't understand that the cost and complexity has come down a lot. ● Question from Will Angel ○ Garbage in Garbage Out: How do we avoid our data lakes turning into data swamps? ○ Answer from Nirmal ■ Stream data in via Kafka ( requires some filtration) ■ Leverage a data catalog (metadata ,schema, name) ○ Other ideas ■ Different data lakes for ingestion , cleaner data , not quite a warehouse ■ Dataset identification / governance ■ Use databricks bronze/silver/gold terminology
  • 5. How do we get data into and out of a data lake? ● Ingress ○ Extract Load Transform (ELT) ○ Extract Transform Load (ETL) ○ Stream into it (Kafka, Spark streaming, Flink, Alpakka) ○ Batch in to it (*, Spark, Mapreduce, etc.) ● Egress ○ Integration to query engines out of the box ■ Cloud ■ Snowflake ■ Storage : S3/Azure Storage ■ Query : Snowflake Query Language ■ Google BigQuery ■ Storage : Google Storage ■ Query : BigQuery ■ Azure Data Analytics ■ Storage : Azure Storage ■ Query: Azure Data Analytics ■ Amazon Redshift Spectrum ■ Storage : S3 ■ Query : SQL ■ Amazon Athena ■ Amazon Glue ■ Open Source ■ Presto ■ Hive ■ SparkSQL / Spark ○ Stream out of it (Spark streaming, Flink, Kafka, Alpakka) ○ Batch out of it (*, Spark, Mapreduce, etc.) ○ Extract Load Transform (ELT) ○ Extract Transform Load (ETL)
  • 6. Implementations ● Original (On Premise) ○ HDFS ○ SAN/NAS ● Open Source ○ Object Storage ■ Min.io ■ CEPH ○ Structured / Formatted Files ■ Parquet ■ JSON ■ CSV ■ XML ■ Delta Lake (Parquet) ○ Structured / Databases ■ BigTable ■ Cassandra ● Cloud ○ S3 / Amazon Athena ○ Azure Data Lake ○ Google Storage / Big Query ○ Snowflake ○ Databricks
  • 7. Resources ● Data lake - Wikipedia https://en.wikipedia.org/wiki/Data_lake ● Three Reasons to Build a Security Data Lake | by Omer Singer | Medium https://medium.com/@osinger/three-reasons-to-build-a-security-data-lake-75d74ff10c6a ● Introduction to Azure Data Lake - DZone Big Data https://dzone.com/articles/introduction-to-azure-data-lake ● What Is a Data Lake and Why Is It Essential for Big Data? https://learn.g2.com/what-is-a-data-lake ● What is a data lake? https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/ ● Cloud Storage as a data lake | Architectures | Google Cloud https://cloud.google.com/solutions/build-a-data-lake-on-gcp ● Netflix/metacat https://github.com/Netflix/metacat
  • 8. Strategy: Scalable Fast Data Architecture: Cassandra, Spark, Kafka Engineering: Node, Python, JVM,CLR Operations: Cloud, Container Rescue: Downtime!! I need help.  www.anant.us | solutions@anant.us | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037