Generative AI on Enterprise Cloud with NiFi and Milvus
Data science platform in modern era of "data swamps"
1. Data Science Platform in
modern era of "data swamps”
Few thoughts and predictions about evolution of toolset for Data Analytics, Data
Science and other things related to BI.
2. About
• I’m working in IT industry for more than 12 years and delivered
more than 50 projects in different domains.
• Working with “big” data production since 2013
• Tech agnostic
• Really know why DevOps is not a name of position and why
unicorpses live
• Analytic: https://medium.com/devoops-and-universe/it-trends-
guide-in-2016-lot-of-marketing-mimicking-and-even-more-
unicorpses-3b68548c72da
• DevOops World Group https://www.linkedin.com/groups/8911035/
• DevOops World Telegram Channel https://t.me/DevOops
Yaroslav Ravlinko|
3. Data Warehouse, "Data lake*”, "Data swamp"
*https://martinfowler.com/bliki/DataLake.html
5. “Data Science” lifecycle
Data Ingestion Model Serving
Model Selection
and Validation
Prototyping and
Training
Feature
Engineering
Data Science/ML platform
6. Let fill the gap between Data Processing and Model Serving.
"Works on my machine"
Data Ingestion Model Serving
Model Selection
and Validation
Prototyping and
Training
Feature
Engineering
Data Science/ML platform
7. Reality of “data science” in production
Data Ingestion Model ServingSomething Important in form of “Model”
Data Science/ML platform
Scheduler Workflow Manager
Configuration and
Deployment System
Authorization,
Authentication and
Audit System
Monitoring
Data Ingestion Model Serving
Model Selection
and Validation
Prototyping and
Training
Feature
Engineering
Data Science/ML platform
8. New toolset
DSPlatform
ML
Sandbox ML Production Platform
K8s Cluster (Tensorflow, Spark and Docker runtime)
CI/CD
Trained Model
Model Scripts
Model Config
Data Scripts
CSV/Parguet/XML etc.
Images, binaries
API
(REST, SQL )
Tensorflow/Keras/etc.Spark/Flink
Clipper/TF
Serving
TF Serving/
Seldom Core
K8s JobsKubeFLowJupyter Hub Argo
Providers
Resource
Management
K8s
Workloads
9. Let go beyond Jupyter Notebook
Master Components
etcd
apiserver
Controllers
Support Operators and CR
Prometheus
ArgoCD
Fluentd
Serving SpaceKubeflow Operators and CR
JupyterHub
Minio
Katib Argo
Istio
tf-job
tf-dashboard Katib UI Pipeline UI Jupyter
Seldon
Clipper
TF Serving
Kubernetes API
PVC
KFServing
Dex
ML Operators and Services
ChainerMPI pytorch
“Big” data operators
MLFlow
Spark
Flink
Kafka
Dask
Metadata