SlideShare a Scribd company logo
1 of 13
Download to read offline
Build an Open Source Data Lake
For Data Scientists
Ke Zhu, Matthew Neal, Emma Dickson, Liwei Wang, Justin Eyster
Disclaimer
The postings on this document are
my own and don't necessarily
represent IBM's positions,
strategies or opinions.
Agenda
●
Problem
●
Solution
●
Features
●
Lessons Learned
●
Future work
Problem – Use Case
●
4TB+ fast growing data
●
Data comes from 20+ data sources in variety of formats
●
Data scientists consume data, not building databases
●
Small team (to manage everything)
●
Supports different tasks - dashboards, reports, machine
learning, indexes, etc
Problem – System challenges
●
Store and query lots of small or big data files
●
No control of data representation
●
Variety of workloads – reporting/visualization/analytics/ML
●
Security/privacy requirements
●
Manage all data stack by small team (3~4 people)
●
Friendly UX for data scientists (and engineers)
Solution - Design
●
Uses data lake to separate data model, storage engine and query engine
●
Cloud-Native (self-service APIs, scale by $), Open API infrastructure
provider
●
Open source data stack to maximize transparency
●
Support multiple query engines (Pandas, SQL, Apache Spark and ChartIO)
Solution – Architecture
Solution – Software & Operation
●
Infrastructure – Kubernetes and S3-compatible object storage
●
data catalog - Apache Hive
●
SQL query engine - PrestoSQL
●
Continuous delivery by Helm over Travis CI
●
UX: JupyterHub and Docker
Highlights – User Experiences
●
Data scientists complete tasks
within a web browser (via
JupyterLab)
●
Virtual data views via domain
oriented data catalog
●
Cross data store query via single
SQL (via PrestoSQL)
●
Be able to data lake from ChartIO
(a dashboard service)
Highlights – DevOps friendly
●
GitOps style operation
– Git repo as single source
– configuration by Helm charts
– software distribution via docker container image
●
All k8s service and data store access within private network
●
Embrace Kubernetes RBAC
Lessons Learned
●
Organize source domains by buckets, store data via Hive style partitions
●
No secrets in Jupyter Notebooks – inject via environment variables
●
Use data compression (gzip, bzip2) w/o data archive formats (.zip, .tar, .rar,
etc)
●
Restore/shutdown database as needed to avoid maintaining uptime
Future work
●
k8s operator for PrestoSQL (github.com/prestosql/presto#396)
●
Metacat as data catalog
●
Minio as object storage software
●
Plug-in more query engines (e.g., Cloud SQL Query)
Reference
●
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
●
How to Layout Big Data in IBM Cloud Object Storage for Spark SQL
●
Maximize observability of your DevOps pipeline with GitOps

More Related Content

What's hot

Metadata Management in Islandora
Metadata Management in IslandoraMetadata Management in Islandora
Metadata Management in IslandoraDavid Wilcox
 
BigQuery for the Big Data win
BigQuery for the Big Data winBigQuery for the Big Data win
BigQuery for the Big Data winKen Taylor
 
Ruby onrails overview
Ruby onrails overviewRuby onrails overview
Ruby onrails overviewPiyush Chand
 
Cloud computing course introduction
Cloud computing course introductionCloud computing course introduction
Cloud computing course introductionHaddy El-Haggan
 
OpenNebulaConf2017EU: Providing cloud and Managed Hosting Environment by Mich...
OpenNebulaConf2017EU: Providing cloud and Managed Hosting Environment by Mich...OpenNebulaConf2017EU: Providing cloud and Managed Hosting Environment by Mich...
OpenNebulaConf2017EU: Providing cloud and Managed Hosting Environment by Mich...OpenNebula Project
 
Open source data_warehousing_overview
Open source data_warehousing_overviewOpen source data_warehousing_overview
Open source data_warehousing_overviewAlex Meadows
 
Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...
Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...
Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...Amazon Web Services
 
Physical design relational_db_devanshu
Physical design relational_db_devanshuPhysical design relational_db_devanshu
Physical design relational_db_devanshuDevanshu Shrivastava
 
Machine Learning on the Microsoft Stack
Machine Learning on the Microsoft StackMachine Learning on the Microsoft Stack
Machine Learning on the Microsoft StackLynn Langit
 
Introduction to Redis Data Structures
Introduction to Redis Data Structures Introduction to Redis Data Structures
Introduction to Redis Data Structures ScaleGrid.io
 
Open source web-gis packages, geoserver-rest and pySLD
Open source web-gis packages, geoserver-rest and pySLDOpen source web-gis packages, geoserver-rest and pySLD
Open source web-gis packages, geoserver-rest and pySLDTek Kshetri
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Not only SQL - Database Choices
Not only SQL - Database ChoicesNot only SQL - Database Choices
Not only SQL - Database ChoicesLynn Langit
 
Expert Roundtable: The Future of Metadata After Hive Metastore
Expert Roundtable: The Future of Metadata After Hive MetastoreExpert Roundtable: The Future of Metadata After Hive Metastore
Expert Roundtable: The Future of Metadata After Hive MetastorelakeFS
 
Visualizing Austin's data with Elasticsearch and Kibana
Visualizing Austin's data with Elasticsearch and KibanaVisualizing Austin's data with Elasticsearch and Kibana
Visualizing Austin's data with Elasticsearch and KibanaObjectRocket
 
Db presentation google_megastore
Db presentation google_megastoreDb presentation google_megastore
Db presentation google_megastoreAlanoud Alqoufi
 

What's hot (20)

Metadata Management in Islandora
Metadata Management in IslandoraMetadata Management in Islandora
Metadata Management in Islandora
 
BigQuery for the Big Data win
BigQuery for the Big Data winBigQuery for the Big Data win
BigQuery for the Big Data win
 
Ruby onrails overview
Ruby onrails overviewRuby onrails overview
Ruby onrails overview
 
Cloud computing course introduction
Cloud computing course introductionCloud computing course introduction
Cloud computing course introduction
 
OpenNebulaConf2017EU: Providing cloud and Managed Hosting Environment by Mich...
OpenNebulaConf2017EU: Providing cloud and Managed Hosting Environment by Mich...OpenNebulaConf2017EU: Providing cloud and Managed Hosting Environment by Mich...
OpenNebulaConf2017EU: Providing cloud and Managed Hosting Environment by Mich...
 
Open source data_warehousing_overview
Open source data_warehousing_overviewOpen source data_warehousing_overview
Open source data_warehousing_overview
 
Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...
Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...
Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...
 
Physical design relational_db_devanshu
Physical design relational_db_devanshuPhysical design relational_db_devanshu
Physical design relational_db_devanshu
 
Machine Learning on the Microsoft Stack
Machine Learning on the Microsoft StackMachine Learning on the Microsoft Stack
Machine Learning on the Microsoft Stack
 
TYPO3 and CMIS
TYPO3 and CMISTYPO3 and CMIS
TYPO3 and CMIS
 
Introduction to Redis Data Structures
Introduction to Redis Data Structures Introduction to Redis Data Structures
Introduction to Redis Data Structures
 
Open source web-gis packages, geoserver-rest and pySLD
Open source web-gis packages, geoserver-rest and pySLDOpen source web-gis packages, geoserver-rest and pySLD
Open source web-gis packages, geoserver-rest and pySLD
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
SortaSQL
SortaSQLSortaSQL
SortaSQL
 
Not only SQL - Database Choices
Not only SQL - Database ChoicesNot only SQL - Database Choices
Not only SQL - Database Choices
 
Expert Roundtable: The Future of Metadata After Hive Metastore
Expert Roundtable: The Future of Metadata After Hive MetastoreExpert Roundtable: The Future of Metadata After Hive Metastore
Expert Roundtable: The Future of Metadata After Hive Metastore
 
Visualizing Austin's data with Elasticsearch and Kibana
Visualizing Austin's data with Elasticsearch and KibanaVisualizing Austin's data with Elasticsearch and Kibana
Visualizing Austin's data with Elasticsearch and Kibana
 
Blockchain at internet scale
Blockchain at internet scaleBlockchain at internet scale
Blockchain at internet scale
 
Shaun-Ellis-feb25
Shaun-Ellis-feb25Shaun-Ellis-feb25
Shaun-Ellis-feb25
 
Db presentation google_megastore
Db presentation google_megastoreDb presentation google_megastore
Db presentation google_megastore
 

Similar to Build an Open Source Data Lake for Data Scientists

IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarRTTS
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureDenodo
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyGuillaume Lefranc
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Big data berlin
Big data berlinBig data berlin
Big data berlinkammeyer
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016aspyker
 

Similar to Build an Open Source Data Lake for Data Scientists (20)

IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Py tables
Py tablesPy tables
Py tables
 
PyTables
PyTablesPyTables
PyTables
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
DA_01_Intro.pptx
DA_01_Intro.pptxDA_01_Intro.pptx
DA_01_Intro.pptx
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
PyTables
PyTablesPyTables
PyTables
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
PostgreSQL and MySQL
PostgreSQL and MySQLPostgreSQL and MySQL
PostgreSQL and MySQL
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 

Recently uploaded

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 

Recently uploaded (20)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 

Build an Open Source Data Lake for Data Scientists

  • 1. Build an Open Source Data Lake For Data Scientists Ke Zhu, Matthew Neal, Emma Dickson, Liwei Wang, Justin Eyster
  • 2. Disclaimer The postings on this document are my own and don't necessarily represent IBM's positions, strategies or opinions.
  • 4. Problem – Use Case ● 4TB+ fast growing data ● Data comes from 20+ data sources in variety of formats ● Data scientists consume data, not building databases ● Small team (to manage everything) ● Supports different tasks - dashboards, reports, machine learning, indexes, etc
  • 5. Problem – System challenges ● Store and query lots of small or big data files ● No control of data representation ● Variety of workloads – reporting/visualization/analytics/ML ● Security/privacy requirements ● Manage all data stack by small team (3~4 people) ● Friendly UX for data scientists (and engineers)
  • 6. Solution - Design ● Uses data lake to separate data model, storage engine and query engine ● Cloud-Native (self-service APIs, scale by $), Open API infrastructure provider ● Open source data stack to maximize transparency ● Support multiple query engines (Pandas, SQL, Apache Spark and ChartIO)
  • 8. Solution – Software & Operation ● Infrastructure – Kubernetes and S3-compatible object storage ● data catalog - Apache Hive ● SQL query engine - PrestoSQL ● Continuous delivery by Helm over Travis CI ● UX: JupyterHub and Docker
  • 9. Highlights – User Experiences ● Data scientists complete tasks within a web browser (via JupyterLab) ● Virtual data views via domain oriented data catalog ● Cross data store query via single SQL (via PrestoSQL) ● Be able to data lake from ChartIO (a dashboard service)
  • 10. Highlights – DevOps friendly ● GitOps style operation – Git repo as single source – configuration by Helm charts – software distribution via docker container image ● All k8s service and data store access within private network ● Embrace Kubernetes RBAC
  • 11. Lessons Learned ● Organize source domains by buckets, store data via Hive style partitions ● No secrets in Jupyter Notebooks – inject via environment variables ● Use data compression (gzip, bzip2) w/o data archive formats (.zip, .tar, .rar, etc) ● Restore/shutdown database as needed to avoid maintaining uptime
  • 12. Future work ● k8s operator for PrestoSQL (github.com/prestosql/presto#396) ● Metacat as data catalog ● Minio as object storage software ● Plug-in more query engines (e.g., Cloud SQL Query)
  • 13. Reference ● How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh ● How to Layout Big Data in IBM Cloud Object Storage for Spark SQL ● Maximize observability of your DevOps pipeline with GitOps