Databricks
Empowering you to do
more with data
Francois Callewaert
Senior Data Scientist
Agenda
▪ Intro
▪ Data Lifecycle
▪ Demo
My career in data tools
Before Now
EDA-ML EDA++, sharing
Build and refresh
own datasets
Build and refresh
own datasets in
Python
Azure SQL DB Azure Data Factory
Azure VM
Windows Scheduler
Model-serving
and delivery
Azure ML
Software (app,
website, IOT…)
User / signals
Raw logs
Clean logs
(de-dup, PII
removal…)
Business data
(aggregates, joins,
metrics, features...)
Analytics
Modelling
Software
Engineer
Data
Engineer
Data
Analyst
Data
Scientist
Recommender
Systems
Business
Decisions
Data lifecycle
Program
Manager
ML Engineer
Bronze Silver Gold
DE/DA/DS
Demo
• Dataset = https://www.kaggle.com/mkechinov/ecommerce-behavior-data-from-multi-category-store
• User actions: VIEW, CART, PURCHASE
• Data Engineer: Delta, Auto-loader
• Data Analyst: Spark SQL, Notebooks, Jobs, SQL Analytics
• Data Scientist: pySpark, Notebooks, Jobs
• GitHub: https://github.com/databricks/tech-talks/tree/master/samples/2020-09-
16%20%7C%20eCommerce%20demo
Dimension table
(id → name)
Data Engineer
Cloud bucket
(raw logs)
CSV files are appended
every few seconds
Software
Engineer
Auto-loader
+JOIN
eCommerce events
(BUY, VIEW, CART)
Clean event logs
Data
Engineer
Data Analyst
Clean event logs
Data
Engineer
Data
Analyst
Notebook
EDA
Job
Product metrics
(Business data)
SQL
Analytics
Dashboard
Data Scientist
Clean event logs
Data
Engineer
Data
Scientis
t
Notebook
EDA - Feature engineering
Job
Feature table
MLFlow
experiment
Prod model
Contact: francois.callewaert@databricks.com
Do more with data.
Databricks
Conclusion

Databricks: A Tool That Empowers You To Do More With Data

  • 1.
    Databricks Empowering you todo more with data Francois Callewaert Senior Data Scientist
  • 2.
    Agenda ▪ Intro ▪ DataLifecycle ▪ Demo
  • 3.
    My career indata tools Before Now EDA-ML EDA++, sharing Build and refresh own datasets Build and refresh own datasets in Python Azure SQL DB Azure Data Factory Azure VM Windows Scheduler Model-serving and delivery Azure ML
  • 4.
    Software (app, website, IOT…) User/ signals Raw logs Clean logs (de-dup, PII removal…) Business data (aggregates, joins, metrics, features...) Analytics Modelling Software Engineer Data Engineer Data Analyst Data Scientist Recommender Systems Business Decisions Data lifecycle Program Manager ML Engineer Bronze Silver Gold DE/DA/DS
  • 5.
    Demo • Dataset =https://www.kaggle.com/mkechinov/ecommerce-behavior-data-from-multi-category-store • User actions: VIEW, CART, PURCHASE • Data Engineer: Delta, Auto-loader • Data Analyst: Spark SQL, Notebooks, Jobs, SQL Analytics • Data Scientist: pySpark, Notebooks, Jobs • GitHub: https://github.com/databricks/tech-talks/tree/master/samples/2020-09- 16%20%7C%20eCommerce%20demo
  • 6.
    Dimension table (id →name) Data Engineer Cloud bucket (raw logs) CSV files are appended every few seconds Software Engineer Auto-loader +JOIN eCommerce events (BUY, VIEW, CART) Clean event logs Data Engineer
  • 7.
    Data Analyst Clean eventlogs Data Engineer Data Analyst Notebook EDA Job Product metrics (Business data) SQL Analytics Dashboard
  • 8.
    Data Scientist Clean eventlogs Data Engineer Data Scientis t Notebook EDA - Feature engineering Job Feature table MLFlow experiment Prod model
  • 9.