SlideShare a Scribd company logo
Building Data Science into
Organizations: Field Experience
Chris Robison
Joseph Bradley
Data + AI Summit 2021
Joseph Bradley
● Sr. Solutions Architect
● 2nd ML Engineer at Databricks
● Apache Spark committer and
PMC member
Our perspectives
Chris Robison
● Sr. Solutions Architect
● Former Director of Data Science
and Omni-channel Marketing at
Overstock.com
● Career data scientist and avid
Apache Spark user
5000+
Across the globe
CUSTOMERS
Lakehouse
One simple platform to unify all of
your data, analytics, and AI workloads
The Data and AI Company
ORIGINAL CREATORS
So you want to do Data Science...
98.8%
14.4%
of Fortune 1,000 companies
are investing in strategic
Big Data & AI initiatives.
of Fortune 1,000 companies say
they have deployed AI capabilities
into widespread production.
Source: New Vantage Partners
Long-term
● Show business impact
● Increase productivity
● Scale DS across the organization
Short-term
● Validate that DS is worthwhile
● Get resources:
○ Data
○ Data Scientists
○ Executive sponsorship
● Show vision
Goals of a DS/ML/AI program
Technology and platform
● Poor integration between Data Science
and other data teams
● Planning for scale and production,
under investment constraints
Organization
● Team building: skill sets, hiring, and training
● Team organization: embedded vs. standalone
● Business and executive alignment
● R&D
Challenges of a DS/ML/AI program
Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
Execution
Use agile processes for data science
● Iterate with sprints and standups
● Fail fast in R&D
Transparency is key
● Communicate frequently to your business partners and executives
● Make business partners and consumers an integral part of process
Collaborate with the data and platform teams
● Make your needs known and understood
● Beware shortcuts which build technical debt
Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
ML/AI Success
● Successful MVPs with
a few models manually
in production
● Starting to build an
AI/ML Strategy
● In discovery phase for
new projects and
low-hanging fruit
Company
● Desire to become data
driven
● Smaller in size
(startup) or an existing
organization with new
data initiatives
Team
● 1-2 Data Scientists
(likely) reporting to a
CTO
● Acting as full stack
data scientists
● Typically a math or
computer science
background
Organization building -- “Crawl” stage
Common tools Descriptions
Notebooks and IDEs Python notebooks, R Studio, Local IDEs
Languages Python, R -- and potentially SQL, Scala, Java, etc.
ML libraries Standard libraries, plus bring-your-own libraries and versions
Git Notebook versioning, and syncing across platforms with Git
Data Pandas, Spark, Koalas; any data sources or formats
Visualization Matplotlib, Plotly, Seaborn, etc.
Integrations Platforms must integrate with any libraries, systems, or services.
Platforms which are cloud-native and have both UIs and APIs are ideal.
Keep using familiar tools
Build around OSS standards for portability
# Downloads / month
990K
350K
1.7M
516K
Be more productive with self-service analytics
Compute resources Libraries and environment
With popular ML libraries
Plug & play environments
requirements.txt
conda.yaml
And customization
Start up machines or
clusters on demand
Cost controls: Autoscaling, auto-termination,
spot instances, cost tracking
Governance: Cluster policies for enforcement
Option 2: Share clusters,
with separate Python
env per user or project.
Option 1: Use your
own cluster
Running example: ML prioritization of Sales opps
Platform enablement
and improvement
Customer history and
Sales data access
Long-term platform and
data pipeline planning
Develop DL
model
Use notebooks +
TensorBoard for
interactive
development.
Analyze
results
Review auto-logged
MLflow metrics to
analyze model
performance.
Load data
Efficient data
loading from S3,
ADLS, etc.
Get an ML
workspace
Simple machine or
cluster creation.
Ready-to-go DS
environments.
Share
results
Share insights
with other
stakeholders
Sync code
Import .py or
.ipynb notebook,
and sync with Git.
Discussion with Sales stakeholders to understand
the problem and data, and to set expectations
Explanation of results and
future potential to Sales
Build executive alignment and
buy-in for long-term initiatives
DS team training
and hiring
Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
ML/AI Success
● Successful MVPs and
production models in
multiple business
units
● Uniform testing
standards are being
established
Company
● Data initiatives being
discussed at the
executive level
● Business units
pushing for data
projects
● Emerging business
champions for AI/ML
Team
● Data Science team(s)
supporting multiple
business units
● Integrations with
software engineering
for production
● Diversifying skill-sets
for domain expertise
Organization building -- “Walk” stage
Data
Preparation
Feature
Engineering
Model
Training
Model
Evaluation
Model
Deployment
Model
Tuning
Model
Consumption
● Koalas
● Spark DataFrames
● Spark UDFs
● Larger instances
● GPUs
● Distributed training
(Spark ML,
HorovodRunner, etc.)
● Hyperopt
● MLflow
● Spark DataFrames & UDFs
● Jobs & Model Servers
● Mlflow
Scaling in a typical machine learning workflow
Auto-logging for reproducibility
Reproduce Run feature:
✓
✓
✓
✓
Code versioning
Data versioning
Cluster configuration
Environment specification
Reproducibility checklist:
Job scheduling in platform
Automation: schedule, alert, retry, API
Automate and reproduce wherever possible
Secure: IAM Passthrough | Cluster Policies | Table ACLs
Your Existing Data Lake
Ingestion
Tables
Data
Catalog
Feature
Store
Azure Data
Lake Storage
Amazon S3
Streaming
Batch
3rd
Party Data
Marketplace
Files
for Data Science and ML
● Schema enforced high
quality data
● Optimized performance
● Full data lineage /
governance
● Reproducibility through
time travel
ML Runtime
IAM Passthrough | Cluster Policies | Table ACLs | Automated Jobs
Infrastructure
Data Engineering Data Science
ML Engineer
Running example: ML-driven products
Scale up
or out
Larger machines.
Multiple GPUs.
Distributed
training.
Schedule training
and inference jobs
Create jobs from
notebooks or libraries.
Add schedules, retries,
and alerts.
Model validation checks.
Automate for
downstream
consumption
Integrate with 3rd-party
tools and systems to
export ML insights to
business stakeholders
Integrate with
data pipelines
Automate ingestion of
new data for ML and
output of ML insights
for business/product
Scale tuning with
Hyperopt + SparkTrials.
Manage tuning with
MLflow autologging.
Improve modeling
process
Executive <> Data Science team
alignment on data-driven initiatives
Knowledge sharing across business
units for ML-driven projects
Education for business stakeholders to
understand ML models and insights
Platform adoption by
multiple business units
Increased governance needs for platform, covering
needs of more business units and personas
Platform plays a key role in
establishing best practices
Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
ML/AI Success
● Successful production
models in multiple
verticals
● Uniform testing
standards established
● Program to grow
citizen data scientists
Company
● Data initiatives are
reported at the board
level
● Data driven decision
making across an
organization
Team
● Multiple Data Science
teams across verticals
led by an AI executive
● Standard
development and
deployment processes
for models
● COE across verticals
Organization building -- “Run” stage
model lifecycle
Staging Production Archived
Data Scientists Deployment Engineers
v1
v2
Models Tracking
Flavor 2
Flavor 1
Model Registry
Custom
Models
In-Line Code
Containers
Batch & Stream
Scoring
Cloud Inference
Services
OSS Serving
Solutions
Serving
Parameters Metrics Artifacts
Models
Metadata
Model
Deployment Options
Example of ML Ops
Training
Model
Validation
Job
Production
Batch
Inference Job
Email
Create model
version
Webhook for new model
versions in staging
Comment with test results +
transition request to production
Webhook for new model
version in production
ML Ops person receives email that
transition request to production was made
Approve new
production model
Model
Registry
Modes of deployment
Model training
Batch
Model Tracking
and Registry
Streaming
REST API
Embedded
Delta Lake /
Feature Store
Latency Cost
Minutes Low
Sec - Min Low - Med
< 1 Sec High
varies varies
BI tools
Repeatable Data Science lifecycle
Business
understanding
Executive
sponsorship
Center of Excellence
for DS & ML
End user
feedback
Metric discussions
and KPIs
Business value
realization
Exploratory
data analysis
Data ingestion
and preparation
Model deployment
and automation
ML modeling
Model monitoring
and feedback
ML and Data platform
and pipeline integration
Simple onboarding process
for new teams and use cases
Data and resource
sharing and governance
Standard handoff process
for production jobs
Sharable documentation
and usage education
Resources to learn more
Related talks and blogs
▪ Building Machine Learning Platforms Webinar
▪ MLflow Model Registry on Databricks Simplifies MLOps With CI/CD Features
Customer success stories
▪ Comcast, Starbucks, H&M
▪ Searchable customer stories
Databricks
▪ Data science and machine learning product page
▪ Managed MLflow product page
Set up reliable and
efficient production
processes.
Scale and automate
DS/ML workloads.
Use popular tools.
Emphasize
productivity.
Platform
Improve executive
visibility and
cross-team
integration.
Build
communication
channels.
Think about
products.
Get quick wins.
Plan for the future.
Philosophy
Strategy
Organization
Crawl Walk Run
Embed Data Science in
the organization’s DNA.
Reproduce end-to-end
and across multiple
verticals.
Building Data Science into Organizations: Field Experience

More Related Content

What's hot

Learn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML Lifecycle
Databricks
 
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
DATAVERSITY
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
Databricks
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
DataWorks Summit/Hadoop Summit
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"
Databricks
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
Drew Hansen
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
How to choose between SharePoint lists, SQL Azure, Microsoft Dataverse with D...
How to choose between SharePoint lists, SQL Azure, Microsoft Dataverse with D...How to choose between SharePoint lists, SQL Azure, Microsoft Dataverse with D...
How to choose between SharePoint lists, SQL Azure, Microsoft Dataverse with D...
serge luca
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
Databricks
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
Databricks
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
Databricks
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
Benjamin Bengfort
 
Data Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data QualityData Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data Quality
DATAVERSITY
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
Databricks
 
Microsoft Dynamics 365 - Intelligent Business Applications
Microsoft Dynamics 365 - Intelligent Business ApplicationsMicrosoft Dynamics 365 - Intelligent Business Applications
Microsoft Dynamics 365 - Intelligent Business Applications
David J Rosenthal
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Anant Corporation
 
What are data products and why are they different from other products?
What are data products and why are they different from other products?What are data products and why are they different from other products?
What are data products and why are they different from other products?
inovex GmbH
 
Building a Big Data Solution
Building a Big Data SolutionBuilding a Big Data Solution
Building a Big Data Solution
James Serra
 

What's hot (20)

Learn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML Lifecycle
 
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
How to choose between SharePoint lists, SQL Azure, Microsoft Dataverse with D...
How to choose between SharePoint lists, SQL Azure, Microsoft Dataverse with D...How to choose between SharePoint lists, SQL Azure, Microsoft Dataverse with D...
How to choose between SharePoint lists, SQL Azure, Microsoft Dataverse with D...
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
 
Data Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data QualityData Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data Quality
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
 
Microsoft Dynamics 365 - Intelligent Business Applications
Microsoft Dynamics 365 - Intelligent Business ApplicationsMicrosoft Dynamics 365 - Intelligent Business Applications
Microsoft Dynamics 365 - Intelligent Business Applications
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
 
What are data products and why are they different from other products?
What are data products and why are they different from other products?What are data products and why are they different from other products?
What are data products and why are they different from other products?
 
Building a Big Data Solution
Building a Big Data SolutionBuilding a Big Data Solution
Building a Big Data Solution
 

Similar to Building Data Science into Organizations: Field Experience

SharePoint Inspired 'Get more from your data with Office 365'
SharePoint Inspired 'Get more from your data with Office 365'SharePoint Inspired 'Get more from your data with Office 365'
SharePoint Inspired 'Get more from your data with Office 365'
Xylos
 
Building an AI organisation
Building an AI organisationBuilding an AI organisation
Building an AI organisation
Vikash Mishra
 
Big Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionBig Data for Data Scientists - Info Session
Big Data for Data Scientists - Info Session
WeCloudData
 
Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...
Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...
Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...
HostedbyConfluent
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
ETCenter
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
patrickdtherriault
 
Northern New England TUG May 2024 - Abbott, Taft, Rugemer
Northern New England TUG May 2024 - Abbott, Taft, RugemerNorthern New England TUG May 2024 - Abbott, Taft, Rugemer
Northern New England TUG May 2024 - Abbott, Taft, Rugemer
patrickdtherriault
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teams
Venkatesh Umaashankar
 
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellNadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
IT Arena
 
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot ProgramszData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot ProgramszData Inc.
 
DevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleDevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-Oracle
atSistemas
 
How to classify documents automatically using NLP
How to classify documents automatically using NLPHow to classify documents automatically using NLP
How to classify documents automatically using NLP
Skyl.ai
 
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Debraj GuhaThakurta
 
Microsoft teams.pdf
Microsoft teams.pdfMicrosoft teams.pdf
Microsoft teams.pdf
sonalibiswas22
 
New Business Development Proposal - Adding Project Portfolio Management (PPM)...
New Business Development Proposal - Adding Project Portfolio Management (PPM)...New Business Development Proposal - Adding Project Portfolio Management (PPM)...
New Business Development Proposal - Adding Project Portfolio Management (PPM)...
Rolly Perreaux, PMP
 
Sandeep resume
Sandeep resumeSandeep resume
Sandeep resume
sandeep chourasia
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning
India Quotient
 
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Precisely
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
James Serra
 

Similar to Building Data Science into Organizations: Field Experience (20)

SharePoint Inspired 'Get more from your data with Office 365'
SharePoint Inspired 'Get more from your data with Office 365'SharePoint Inspired 'Get more from your data with Office 365'
SharePoint Inspired 'Get more from your data with Office 365'
 
Building an AI organisation
Building an AI organisationBuilding an AI organisation
Building an AI organisation
 
Big Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionBig Data for Data Scientists - Info Session
Big Data for Data Scientists - Info Session
 
Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...
Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...
Building a Data Streaming Center of Excellence With Steve Gonzalez and Derek ...
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Northern New England TUG May 2024 - Abbott, Taft, Rugemer
Northern New England TUG May 2024 - Abbott, Taft, RugemerNorthern New England TUG May 2024 - Abbott, Taft, Rugemer
Northern New England TUG May 2024 - Abbott, Taft, Rugemer
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teams
 
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellNadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
 
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot ProgramszData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
 
DevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleDevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-Oracle
 
How to classify documents automatically using NLP
How to classify documents automatically using NLPHow to classify documents automatically using NLP
How to classify documents automatically using NLP
 
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
 
Microsoft teams.pdf
Microsoft teams.pdfMicrosoft teams.pdf
Microsoft teams.pdf
 
New Business Development Proposal - Adding Project Portfolio Management (PPM)...
New Business Development Proposal - Adding Project Portfolio Management (PPM)...New Business Development Proposal - Adding Project Portfolio Management (PPM)...
New Business Development Proposal - Adding Project Portfolio Management (PPM)...
 
Sandeep resume
Sandeep resumeSandeep resume
Sandeep resume
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning
 
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
FJ_Trainer
FJ_TrainerFJ_Trainer
FJ_Trainer
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 

Recently uploaded (20)

一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 

Building Data Science into Organizations: Field Experience

  • 1. Building Data Science into Organizations: Field Experience Chris Robison Joseph Bradley Data + AI Summit 2021
  • 2. Joseph Bradley ● Sr. Solutions Architect ● 2nd ML Engineer at Databricks ● Apache Spark committer and PMC member Our perspectives Chris Robison ● Sr. Solutions Architect ● Former Director of Data Science and Omni-channel Marketing at Overstock.com ● Career data scientist and avid Apache Spark user
  • 3. 5000+ Across the globe CUSTOMERS Lakehouse One simple platform to unify all of your data, analytics, and AI workloads The Data and AI Company ORIGINAL CREATORS
  • 4. So you want to do Data Science... 98.8% 14.4% of Fortune 1,000 companies are investing in strategic Big Data & AI initiatives. of Fortune 1,000 companies say they have deployed AI capabilities into widespread production. Source: New Vantage Partners
  • 5. Long-term ● Show business impact ● Increase productivity ● Scale DS across the organization Short-term ● Validate that DS is worthwhile ● Get resources: ○ Data ○ Data Scientists ○ Executive sponsorship ● Show vision Goals of a DS/ML/AI program
  • 6. Technology and platform ● Poor integration between Data Science and other data teams ● Planning for scale and production, under investment constraints Organization ● Team building: skill sets, hiring, and training ● Team organization: embedded vs. standalone ● Business and executive alignment ● R&D Challenges of a DS/ML/AI program
  • 7. Set up reliable and efficient production processes. Scale and automate DS/ML workloads. Use popular tools. Emphasize productivity. Platform Improve executive visibility and cross-team integration. Build communication channels. Think about products. Get quick wins. Plan for the future. Philosophy Strategy Organization Crawl Walk Run Embed Data Science in the organization’s DNA. Reproduce end-to-end and across multiple verticals.
  • 8. Execution Use agile processes for data science ● Iterate with sprints and standups ● Fail fast in R&D Transparency is key ● Communicate frequently to your business partners and executives ● Make business partners and consumers an integral part of process Collaborate with the data and platform teams ● Make your needs known and understood ● Beware shortcuts which build technical debt
  • 9. Set up reliable and efficient production processes. Scale and automate DS/ML workloads. Use popular tools. Emphasize productivity. Platform Improve executive visibility and cross-team integration. Build communication channels. Think about products. Get quick wins. Plan for the future. Philosophy Strategy Organization Crawl Walk Run Embed Data Science in the organization’s DNA. Reproduce end-to-end and across multiple verticals.
  • 10. ML/AI Success ● Successful MVPs with a few models manually in production ● Starting to build an AI/ML Strategy ● In discovery phase for new projects and low-hanging fruit Company ● Desire to become data driven ● Smaller in size (startup) or an existing organization with new data initiatives Team ● 1-2 Data Scientists (likely) reporting to a CTO ● Acting as full stack data scientists ● Typically a math or computer science background Organization building -- “Crawl” stage
  • 11. Common tools Descriptions Notebooks and IDEs Python notebooks, R Studio, Local IDEs Languages Python, R -- and potentially SQL, Scala, Java, etc. ML libraries Standard libraries, plus bring-your-own libraries and versions Git Notebook versioning, and syncing across platforms with Git Data Pandas, Spark, Koalas; any data sources or formats Visualization Matplotlib, Plotly, Seaborn, etc. Integrations Platforms must integrate with any libraries, systems, or services. Platforms which are cloud-native and have both UIs and APIs are ideal. Keep using familiar tools
  • 12. Build around OSS standards for portability # Downloads / month 990K 350K 1.7M 516K
  • 13. Be more productive with self-service analytics Compute resources Libraries and environment With popular ML libraries Plug & play environments requirements.txt conda.yaml And customization Start up machines or clusters on demand Cost controls: Autoscaling, auto-termination, spot instances, cost tracking Governance: Cluster policies for enforcement Option 2: Share clusters, with separate Python env per user or project. Option 1: Use your own cluster
  • 14. Running example: ML prioritization of Sales opps Platform enablement and improvement Customer history and Sales data access Long-term platform and data pipeline planning Develop DL model Use notebooks + TensorBoard for interactive development. Analyze results Review auto-logged MLflow metrics to analyze model performance. Load data Efficient data loading from S3, ADLS, etc. Get an ML workspace Simple machine or cluster creation. Ready-to-go DS environments. Share results Share insights with other stakeholders Sync code Import .py or .ipynb notebook, and sync with Git. Discussion with Sales stakeholders to understand the problem and data, and to set expectations Explanation of results and future potential to Sales Build executive alignment and buy-in for long-term initiatives DS team training and hiring
  • 15. Set up reliable and efficient production processes. Scale and automate DS/ML workloads. Use popular tools. Emphasize productivity. Platform Improve executive visibility and cross-team integration. Build communication channels. Think about products. Get quick wins. Plan for the future. Philosophy Strategy Organization Crawl Walk Run Embed Data Science in the organization’s DNA. Reproduce end-to-end and across multiple verticals.
  • 16. ML/AI Success ● Successful MVPs and production models in multiple business units ● Uniform testing standards are being established Company ● Data initiatives being discussed at the executive level ● Business units pushing for data projects ● Emerging business champions for AI/ML Team ● Data Science team(s) supporting multiple business units ● Integrations with software engineering for production ● Diversifying skill-sets for domain expertise Organization building -- “Walk” stage
  • 17.
  • 18. Data Preparation Feature Engineering Model Training Model Evaluation Model Deployment Model Tuning Model Consumption ● Koalas ● Spark DataFrames ● Spark UDFs ● Larger instances ● GPUs ● Distributed training (Spark ML, HorovodRunner, etc.) ● Hyperopt ● MLflow ● Spark DataFrames & UDFs ● Jobs & Model Servers ● Mlflow Scaling in a typical machine learning workflow
  • 19. Auto-logging for reproducibility Reproduce Run feature: ✓ ✓ ✓ ✓ Code versioning Data versioning Cluster configuration Environment specification Reproducibility checklist: Job scheduling in platform Automation: schedule, alert, retry, API Automate and reproduce wherever possible Secure: IAM Passthrough | Cluster Policies | Table ACLs
  • 20. Your Existing Data Lake Ingestion Tables Data Catalog Feature Store Azure Data Lake Storage Amazon S3 Streaming Batch 3rd Party Data Marketplace Files for Data Science and ML ● Schema enforced high quality data ● Optimized performance ● Full data lineage / governance ● Reproducibility through time travel ML Runtime IAM Passthrough | Cluster Policies | Table ACLs | Automated Jobs Infrastructure Data Engineering Data Science ML Engineer
  • 21. Running example: ML-driven products Scale up or out Larger machines. Multiple GPUs. Distributed training. Schedule training and inference jobs Create jobs from notebooks or libraries. Add schedules, retries, and alerts. Model validation checks. Automate for downstream consumption Integrate with 3rd-party tools and systems to export ML insights to business stakeholders Integrate with data pipelines Automate ingestion of new data for ML and output of ML insights for business/product Scale tuning with Hyperopt + SparkTrials. Manage tuning with MLflow autologging. Improve modeling process Executive <> Data Science team alignment on data-driven initiatives Knowledge sharing across business units for ML-driven projects Education for business stakeholders to understand ML models and insights Platform adoption by multiple business units Increased governance needs for platform, covering needs of more business units and personas Platform plays a key role in establishing best practices
  • 22. Set up reliable and efficient production processes. Scale and automate DS/ML workloads. Use popular tools. Emphasize productivity. Platform Improve executive visibility and cross-team integration. Build communication channels. Think about products. Get quick wins. Plan for the future. Philosophy Strategy Organization Crawl Walk Run Embed Data Science in the organization’s DNA. Reproduce end-to-end and across multiple verticals.
  • 23. ML/AI Success ● Successful production models in multiple verticals ● Uniform testing standards established ● Program to grow citizen data scientists Company ● Data initiatives are reported at the board level ● Data driven decision making across an organization Team ● Multiple Data Science teams across verticals led by an AI executive ● Standard development and deployment processes for models ● COE across verticals Organization building -- “Run” stage
  • 24. model lifecycle Staging Production Archived Data Scientists Deployment Engineers v1 v2 Models Tracking Flavor 2 Flavor 1 Model Registry Custom Models In-Line Code Containers Batch & Stream Scoring Cloud Inference Services OSS Serving Solutions Serving Parameters Metrics Artifacts Models Metadata Model Deployment Options
  • 25. Example of ML Ops Training Model Validation Job Production Batch Inference Job Email Create model version Webhook for new model versions in staging Comment with test results + transition request to production Webhook for new model version in production ML Ops person receives email that transition request to production was made Approve new production model Model Registry
  • 26. Modes of deployment Model training Batch Model Tracking and Registry Streaming REST API Embedded Delta Lake / Feature Store Latency Cost Minutes Low Sec - Min Low - Med < 1 Sec High varies varies BI tools
  • 27. Repeatable Data Science lifecycle Business understanding Executive sponsorship Center of Excellence for DS & ML End user feedback Metric discussions and KPIs Business value realization Exploratory data analysis Data ingestion and preparation Model deployment and automation ML modeling Model monitoring and feedback ML and Data platform and pipeline integration Simple onboarding process for new teams and use cases Data and resource sharing and governance Standard handoff process for production jobs Sharable documentation and usage education
  • 28. Resources to learn more Related talks and blogs ▪ Building Machine Learning Platforms Webinar ▪ MLflow Model Registry on Databricks Simplifies MLOps With CI/CD Features Customer success stories ▪ Comcast, Starbucks, H&M ▪ Searchable customer stories Databricks ▪ Data science and machine learning product page ▪ Managed MLflow product page
  • 29. Set up reliable and efficient production processes. Scale and automate DS/ML workloads. Use popular tools. Emphasize productivity. Platform Improve executive visibility and cross-team integration. Build communication channels. Think about products. Get quick wins. Plan for the future. Philosophy Strategy Organization Crawl Walk Run Embed Data Science in the organization’s DNA. Reproduce end-to-end and across multiple verticals.