This is a talk I presented in 2019 ICSA (International Chinese Statistics Association) Applied Statistics Symposium in session "How Data Science Drives Success in Enterprises"
4. Problem – Use Case
●
4TB+ fast growing data
●
Data comes from 20+ data sources in variety of formats
●
Data scientists consume data, not building databases
●
Small team (to manage everything)
●
Supports different tasks - dashboards, reports, machine
learning, indexes, etc
5. Problem – System challenges
●
Store and query lots of small or big data files
●
No control of data representation
●
Variety of workloads – reporting/visualization/analytics/ML
●
Security/privacy requirements
●
Manage all data stack by small team (3~4 people)
●
Friendly UX for data scientists (and engineers)
6. Solution - Design
●
Uses data lake to separate data model, storage engine and query engine
●
Cloud-Native (self-service APIs, scale by $), Open API infrastructure
provider
●
Open source data stack to maximize transparency
●
Support multiple query engines (Pandas, SQL, Apache Spark and ChartIO)
8. Solution – Software & Operation
●
Infrastructure – Kubernetes and S3-compatible object storage
●
data catalog - Apache Hive
●
SQL query engine - PrestoSQL
●
Continuous delivery by Helm over Travis CI
●
UX: JupyterHub and Docker
9. Highlights – User Experiences
●
Data scientists complete tasks
within a web browser (via
JupyterLab)
●
Virtual data views via domain
oriented data catalog
●
Cross data store query via single
SQL (via PrestoSQL)
●
Be able to data lake from ChartIO
(a dashboard service)
10. Highlights – DevOps friendly
●
GitOps style operation
– Git repo as single source
– configuration by Helm charts
– software distribution via docker container image
●
All k8s service and data store access within private network
●
Embrace Kubernetes RBAC
11. Lessons Learned
●
Organize source domains by buckets, store data via Hive style partitions
●
No secrets in Jupyter Notebooks – inject via environment variables
●
Use data compression (gzip, bzip2) w/o data archive formats (.zip, .tar, .rar,
etc)
●
Restore/shutdown database as needed to avoid maintaining uptime
12. Future work
●
k8s operator for PrestoSQL (github.com/prestosql/presto#396)
●
Metacat as data catalog
●
Minio as object storage software
●
Plug-in more query engines (e.g., Cloud SQL Query)
13. Reference
●
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
●
How to Layout Big Data in IBM Cloud Object Storage for Spark SQL
●
Maximize observability of your DevOps pipeline with GitOps