Build an Open Source Data Lake for Data Scientists

•

0 likes•336 views

This is a talk I presented in 2019 ICSA (International Chinese Statistics Association) Applied Statistics Symposium in session "How Data Science Drives Success in Enterprises"

Data & Analytics

Build an Open Source Data Lake
For Data Scientists
Ke Zhu, Matthew Neal, Emma Dickson, Liwei Wang, Justin Eyster

Disclaimer
The postings on this document are
my own and don't necessarily
represent IBM's positions,
strategies or opinions.

Agenda
●
Problem
●
Solution
●
Features
●
Lessons Learned
●
Future work

Problem – Use Case
●
4TB+ fast growing data
●
Data comes from 20+ data sources in variety of formats
●
Data scientists consume data, not building databases
●
Small team (to manage everything)
●
Supports different tasks - dashboards, reports, machine
learning, indexes, etc

Problem – System challenges
●
Store and query lots of small or big data files
●
No control of data representation
●
Variety of workloads – reporting/visualization/analytics/ML
●
Security/privacy requirements
●
Manage all data stack by small team (3~4 people)
●
Friendly UX for data scientists (and engineers)

Solution - Design
●
Uses data lake to separate data model, storage engine and query engine
●
Cloud-Native (self-service APIs, scale by $), Open API infrastructure
provider
●
Open source data stack to maximize transparency
●
Support multiple query engines (Pandas, SQL, Apache Spark and ChartIO)

Solution – Software & Operation
●
Infrastructure – Kubernetes and S3-compatible object storage
●
data catalog - Apache Hive
●
SQL query engine - PrestoSQL
●
Continuous delivery by Helm over Travis CI
●
UX: JupyterHub and Docker

Highlights – User Experiences
●
Data scientists complete tasks
within a web browser (via
JupyterLab)
●
Virtual data views via domain
oriented data catalog
●
Cross data store query via single
SQL (via PrestoSQL)
●
Be able to data lake from ChartIO
(a dashboard service)

Highlights – DevOps friendly
●
GitOps style operation
– Git repo as single source
– configuration by Helm charts
– software distribution via docker container image
●
All k8s service and data store access within private network
●
Embrace Kubernetes RBAC

Lessons Learned
●
Organize source domains by buckets, store data via Hive style partitions
●
No secrets in Jupyter Notebooks – inject via environment variables
●
Use data compression (gzip, bzip2) w/o data archive formats (.zip, .tar, .rar,
etc)
●
Restore/shutdown database as needed to avoid maintaining uptime

Future work
●
k8s operator for PrestoSQL (github.com/prestosql/presto#396)
●
Metacat as data catalog
●
Minio as object storage software
●
Plug-in more query engines (e.g., Cloud SQL Query)

Reference
●
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
●
How to Layout Big Data in IBM Cloud Object Storage for Spark SQL
●
Maximize observability of your DevOps pipeline with GitOps

What's hot

Metadata Management in IslandoraDavid Wilcox

BigQuery for the Big Data winKen Taylor

Ruby onrails overviewPiyush Chand

Cloud computing course introductionHaddy El-Haggan

OpenNebulaConf2017EU: Providing cloud and Managed Hosting Environment by Mich...OpenNebula Project

Open source data_warehousing_overviewAlex Meadows

Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...Amazon Web Services

Physical design relational_db_devanshuDevanshu Shrivastava

Machine Learning on the Microsoft StackLynn Langit

TYPO3 and CMISOlivier Dobberkau

Introduction to Redis Data Structures ScaleGrid.io

Open source web-gis packages, geoserver-rest and pySLDTek Kshetri

Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle

SortaSQLCloudflare

Not only SQL - Database ChoicesLynn Langit

Expert Roundtable: The Future of Metadata After Hive MetastorelakeFS

Visualizing Austin's data with Elasticsearch and KibanaObjectRocket

Blockchain at internet scaleDigital Currency Summit

Shaun-Ellis-feb25National Information Standards Organization (NISO)

Db presentation google_megastoreAlanoud Alqoufi

What's hot (20)

Metadata Management in Islandora

BigQuery for the Big Data win

Ruby onrails overview

Cloud computing course introduction

OpenNebulaConf2017EU: Providing cloud and Managed Hosting Environment by Mich...

Open source data_warehousing_overview

Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...

Physical design relational_db_devanshu

Machine Learning on the Microsoft Stack

TYPO3 and CMIS

Introduction to Redis Data Structures

Open source web-gis packages, geoserver-rest and pySLD

Apache Iceberg Presentation for the St. Louis Big Data IDEA

SortaSQL

Not only SQL - Database Choices

Expert Roundtable: The Future of Metadata After Hive Metastore

Visualizing Austin's data with Elasticsearch and Kibana

Blockchain at internet scale

Shaun-Ellis-feb25

Db presentation google_megastore

Similar to Build an Open Source Data Lake for Data Scientists

IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak

Py tablesAli Hallaji

PyTablesAli Hallaji

Large Data Analyze With PyTablesInnfinision Cloud and BigData Solutions

Understanding HadoopAhmed Ossama

DA_01_Intro.pptxAlok Mohapatra

QuerySurge Slide Deck for Big Data Testing WebinarRTTS

PyTablesAli Hallaji

Big data nyuEdward Capriolo

PostgreSQL and MySQLPostgreSQL Experts, Inc.

WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan

Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureDenodo

NoSQL Solutions - a comparative studyGuillaume Lefranc

An overview of modern scalable web developmentTung Nguyen

From Data Warehouse to LakehouseModern Data Stack France

ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY

Architecting a datalakeLaurent Leturgez

Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra

Big data berlinkammeyer

Netflix Container Scheduling and Execution - QCon New York 2016aspyker

Similar to Build an Open Source Data Lake for Data Scientists (20)

IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...

Py tables

PyTables

Large Data Analyze With PyTables

Understanding Hadoop

DA_01_Intro.pptx

QuerySurge Slide Deck for Big Data Testing Webinar

PyTables

Big data nyu

PostgreSQL and MySQL

WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms

Shaping the Role of a Data Lake in a Modern Data Fabric Architecture

NoSQL Solutions - a comparative study

An overview of modern scalable web development

From Data Warehouse to Lakehouse

ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture

Architecting a datalake

Data Lakehouse, Data Mesh, and Data Fabric (r2)

Big data berlin

Netflix Container Scheduling and Execution - QCon New York 2016

Recently uploaded

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

RadioAdProWritingCinderellabyButleri.pdfgstagge

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation

Data Science Jobs and Salaries Analysis.pptxFurkanTasci3

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

Recently uploaded (20)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps

04242024_CCC TUG_Joins and Relationships

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

RadioAdProWritingCinderellabyButleri.pdf

Call Girls in Saket 99530🔝 56974 Escort Service

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...

Data Science Jobs and Salaries Analysis.pptx

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

Schema on read is obsolete. Welcome metaprogramming..pdf

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

GA4 Without Cookies [Measure Camp AMS]

Build an Open Source Data Lake for Data Scientists

1. Build an Open Source Data Lake For Data Scientists Ke Zhu, Matthew Neal, Emma Dickson, Liwei Wang, Justin Eyster

2. Disclaimer The postings on this document are my own and don't necessarily represent IBM's positions, strategies or opinions.

3. Agenda ● Problem ● Solution ● Features ● Lessons Learned ● Future work

4. Problem – Use Case ● 4TB+ fast growing data ● Data comes from 20+ data sources in variety of formats ● Data scientists consume data, not building databases ● Small team (to manage everything) ● Supports different tasks - dashboards, reports, machine learning, indexes, etc

5. Problem – System challenges ● Store and query lots of small or big data files ● No control of data representation ● Variety of workloads – reporting/visualization/analytics/ML ● Security/privacy requirements ● Manage all data stack by small team (3~4 people) ● Friendly UX for data scientists (and engineers)

6. Solution - Design ● Uses data lake to separate data model, storage engine and query engine ● Cloud-Native (self-service APIs, scale by $), Open API infrastructure provider ● Open source data stack to maximize transparency ● Support multiple query engines (Pandas, SQL, Apache Spark and ChartIO)

7. Solution – Architecture

8. Solution – Software & Operation ● Infrastructure – Kubernetes and S3-compatible object storage ● data catalog - Apache Hive ● SQL query engine - PrestoSQL ● Continuous delivery by Helm over Travis CI ● UX: JupyterHub and Docker

9. Highlights – User Experiences ● Data scientists complete tasks within a web browser (via JupyterLab) ● Virtual data views via domain oriented data catalog ● Cross data store query via single SQL (via PrestoSQL) ● Be able to data lake from ChartIO (a dashboard service)

10. Highlights – DevOps friendly ● GitOps style operation – Git repo as single source – configuration by Helm charts – software distribution via docker container image ● All k8s service and data store access within private network ● Embrace Kubernetes RBAC

11. Lessons Learned ● Organize source domains by buckets, store data via Hive style partitions ● No secrets in Jupyter Notebooks – inject via environment variables ● Use data compression (gzip, bzip2) w/o data archive formats (.zip, .tar, .rar, etc) ● Restore/shutdown database as needed to avoid maintaining uptime

12. Future work ● k8s operator for PrestoSQL (github.com/prestosql/presto#396) ● Metacat as data catalog ● Minio as object storage software ● Plug-in more query engines (e.g., Cloud SQL Query)

13. Reference ● How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh ● How to Layout Big Data in IBM Cloud Object Storage for Spark SQL ● Maximize observability of your DevOps pipeline with GitOps

Build an Open Source Data Lake for Data Scientists

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Build an Open Source Data Lake for Data Scientists

Similar to Build an Open Source Data Lake for Data Scientists (20)

Recently uploaded

Recently uploaded (20)

Build an Open Source Data Lake for Data Scientists