SlideShare a Scribd company logo
How to build consistent, scalable
workspaces for data science teams
Elaine Lee
Data science is hard.
Doing data science is even harder.
Ensuring enough resourcesManaging dependencies
http://www.seriouseats.com/assets_c/2014/06/20140525-294370-best-deep-dish-pizza-art-of-pizza-
primary-thumb-1500xauto-404176.jpghttps://s-media-cache-ak0.pinimg.com/736x/91/6b/f0/916bf0f23660fc7019353800668060af.jpg
Nail it down
Identify system requirements for base Docker image
Stabilize dependencies for data science work environment
Increase test coverage
Get continuous integration (CI) platform on the same page
Scale it up
Create a pool of worker machines ready to accept jobs
Set up an asynchronous task queue
Provide a simple command line interface for data scientists
Putting it all together
Pull changes Start Docker
container
Run test suite Report Pass/Fail Export image for
commit
Commit pushed
to Github
Report resultGet image for
commit
Start container
from image
Run task
Request arrives
in queue
workers
123abc…123abc…
123abc…123abc…
s3
Benefits
Flexible to any
composition of EC2
instances
-Extensible to EMR
Task environment
guaranteed
-Isolated from other tasks
-Identical to conditions at
time of development
One-time configuration
-EC2 AMI
Extensible command line
interface
-R interface
-Cluster management
-Job monitoring
Use case: Quality assurance
CI testing
Other tests
- Data validation
- Model consistency
http://img.pandawhale.com/post-52368-thanks-obama-making-sandwich-m-whnc.jpeg
Use case: Parallelizable tasks
Data manipulation
- Feature engineering
Model builds
- Advanced machine learning algorithms
- Hyperparameter search
https://pbs.twimg.com/media/Buw8Bz6IIAAxgxg.png
Elaine Lee
Data Engineer
elaine@elaineklee.com
@elaineklee
avant.com
Elaine Lee
Data Engineer
elaine@elaineklee.com
@elaineklee
avant.com

More Related Content

Similar to How to Build Consistent and Scalable Workspaces for Data Science Teams

Introduction and CloudStack news
Introduction and CloudStack newsIntroduction and CloudStack news
Introduction and CloudStack news
ShapeBlue
 
Spring and Pivotal Application Service - SpringOne Tour Dallas
Spring and Pivotal Application Service - SpringOne Tour DallasSpring and Pivotal Application Service - SpringOne Tour Dallas
Spring and Pivotal Application Service - SpringOne Tour Dallas
VMware Tanzu
 
Evolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming PipelinesEvolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming Pipelines
Databricks
 
Arquitectura en detalle de una anatomia devops
Arquitectura en detalle de una anatomia devopsArquitectura en detalle de una anatomia devops
Arquitectura en detalle de una anatomia devops
Orlando Chamorro
 
Anatomy of a Continuous Integration and Delivery (CICD) Pipeline
Anatomy of a Continuous Integration and Delivery (CICD) PipelineAnatomy of a Continuous Integration and Delivery (CICD) Pipeline
Anatomy of a Continuous Integration and Delivery (CICD) Pipeline
Robert McDermott
 
Prediction io 架構與整合 -DataCon.TW-2017
Prediction io 架構與整合 -DataCon.TW-2017Prediction io 架構與整合 -DataCon.TW-2017
Prediction io 架構與整合 -DataCon.TW-2017
William Lee
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
MLconf
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
Josh Patterson
 
HighAvailabilityForSharepoint
HighAvailabilityForSharepointHighAvailabilityForSharepoint
HighAvailabilityForSharepointJason Dover
 
Data science for infrastructure dev week 2022
Data science for infrastructure   dev week 2022Data science for infrastructure   dev week 2022
Data science for infrastructure dev week 2022
ZainAsgar1
 
DevOps Pragmatic Overview
DevOps Pragmatic OverviewDevOps Pragmatic Overview
DevOps Pragmatic Overview
Mykola Marzhan
 
1z0-997-20-oci-professional-incomplete.pdf
1z0-997-20-oci-professional-incomplete.pdf1z0-997-20-oci-professional-incomplete.pdf
1z0-997-20-oci-professional-incomplete.pdf
MohamedHusseinEid
 
Camel on Cloud by Christina Lin
Camel on Cloud by Christina LinCamel on Cloud by Christina Lin
Camel on Cloud by Christina Lin
Tadayoshi Sato
 
Sciences PO
Sciences POSciences PO
Sciences PO
Cisco Case Studies
 
Pm440 Presentation Black Cloud
Pm440 Presentation Black CloudPm440 Presentation Black Cloud
Pm440 Presentation Black Cloud
guesta946d0
 
Cloud foundry: The Platform for Forging Cloud Native Applications
Cloud foundry: The Platform for Forging Cloud Native ApplicationsCloud foundry: The Platform for Forging Cloud Native Applications
Cloud foundry: The Platform for Forging Cloud Native Applications
Chip Childers
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
Cisco DevNet
 
Continuous Deployment: The Dirty Details
Continuous Deployment: The Dirty DetailsContinuous Deployment: The Dirty Details
Continuous Deployment: The Dirty Details
Mike Brittain
 

Similar to How to Build Consistent and Scalable Workspaces for Data Science Teams (20)

Introduction and CloudStack news
Introduction and CloudStack newsIntroduction and CloudStack news
Introduction and CloudStack news
 
Spring and Pivotal Application Service - SpringOne Tour Dallas
Spring and Pivotal Application Service - SpringOne Tour DallasSpring and Pivotal Application Service - SpringOne Tour Dallas
Spring and Pivotal Application Service - SpringOne Tour Dallas
 
Evolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming PipelinesEvolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming Pipelines
 
Arquitectura en detalle de una anatomia devops
Arquitectura en detalle de una anatomia devopsArquitectura en detalle de una anatomia devops
Arquitectura en detalle de una anatomia devops
 
Anatomy of a Continuous Integration and Delivery (CICD) Pipeline
Anatomy of a Continuous Integration and Delivery (CICD) PipelineAnatomy of a Continuous Integration and Delivery (CICD) Pipeline
Anatomy of a Continuous Integration and Delivery (CICD) Pipeline
 
Prediction io 架構與整合 -DataCon.TW-2017
Prediction io 架構與整合 -DataCon.TW-2017Prediction io 架構與整合 -DataCon.TW-2017
Prediction io 架構與整合 -DataCon.TW-2017
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
 
HighAvailabilityForSharepoint
HighAvailabilityForSharepointHighAvailabilityForSharepoint
HighAvailabilityForSharepoint
 
Data science for infrastructure dev week 2022
Data science for infrastructure   dev week 2022Data science for infrastructure   dev week 2022
Data science for infrastructure dev week 2022
 
DevOps Pragmatic Overview
DevOps Pragmatic OverviewDevOps Pragmatic Overview
DevOps Pragmatic Overview
 
1z0-997-20-oci-professional-incomplete.pdf
1z0-997-20-oci-professional-incomplete.pdf1z0-997-20-oci-professional-incomplete.pdf
1z0-997-20-oci-professional-incomplete.pdf
 
Velocity Report 2009
Velocity Report 2009Velocity Report 2009
Velocity Report 2009
 
Camel on Cloud by Christina Lin
Camel on Cloud by Christina LinCamel on Cloud by Christina Lin
Camel on Cloud by Christina Lin
 
Sciences PO
Sciences POSciences PO
Sciences PO
 
Pm440 Presentation Black Cloud
Pm440 Presentation Black CloudPm440 Presentation Black Cloud
Pm440 Presentation Black Cloud
 
1z0-997-21 (4).pdf
1z0-997-21 (4).pdf1z0-997-21 (4).pdf
1z0-997-21 (4).pdf
 
Cloud foundry: The Platform for Forging Cloud Native Applications
Cloud foundry: The Platform for Forging Cloud Native ApplicationsCloud foundry: The Platform for Forging Cloud Native Applications
Cloud foundry: The Platform for Forging Cloud Native Applications
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Continuous Deployment: The Dirty Details
Continuous Deployment: The Dirty DetailsContinuous Deployment: The Dirty Details
Continuous Deployment: The Dirty Details
 

Recently uploaded

Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
Google
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 

Recently uploaded (20)

Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 

How to Build Consistent and Scalable Workspaces for Data Science Teams

  • 1. How to build consistent, scalable workspaces for data science teams Elaine Lee
  • 2. Data science is hard. Doing data science is even harder. Ensuring enough resourcesManaging dependencies http://www.seriouseats.com/assets_c/2014/06/20140525-294370-best-deep-dish-pizza-art-of-pizza- primary-thumb-1500xauto-404176.jpghttps://s-media-cache-ak0.pinimg.com/736x/91/6b/f0/916bf0f23660fc7019353800668060af.jpg
  • 3. Nail it down Identify system requirements for base Docker image Stabilize dependencies for data science work environment Increase test coverage Get continuous integration (CI) platform on the same page
  • 4. Scale it up Create a pool of worker machines ready to accept jobs Set up an asynchronous task queue Provide a simple command line interface for data scientists
  • 5. Putting it all together Pull changes Start Docker container Run test suite Report Pass/Fail Export image for commit Commit pushed to Github Report resultGet image for commit Start container from image Run task Request arrives in queue workers 123abc…123abc… 123abc…123abc… s3
  • 6. Benefits Flexible to any composition of EC2 instances -Extensible to EMR Task environment guaranteed -Isolated from other tasks -Identical to conditions at time of development One-time configuration -EC2 AMI Extensible command line interface -R interface -Cluster management -Job monitoring
  • 7. Use case: Quality assurance CI testing Other tests - Data validation - Model consistency http://img.pandawhale.com/post-52368-thanks-obama-making-sandwich-m-whnc.jpeg
  • 8. Use case: Parallelizable tasks Data manipulation - Feature engineering Model builds - Advanced machine learning algorithms - Hyperparameter search https://pbs.twimg.com/media/Buw8Bz6IIAAxgxg.png

Editor's Notes

  1. http://static.techspot.com/articles-info/788/images/world-wide-web-25-years-super.jpg http://www.seriouseats.com/assets_c/2014/06/20140525-294370-best-deep-dish-pizza-art-of-pizza-primary-thumb-1500xauto-404176.jpg https://s-media-cache-ak0.pinimg.com/736x/91/6b/f0/916bf0f23660fc7019353800668060af.jpg
  2. https://utbrudd.bouvet.no/wp-content/uploads/2015/02/jenkins-docker.png https://www.r-project.org/Rlogo.png
  3. https://upload.wikimedia.org/wikipedia/commons/thumb/1/1d/AmazonWebservices_Logo.svg/2000px-AmazonWebservices_Logo.svg.png http://www.netuitive.com/wp-content/uploads/integration_logo_celery_new.png https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Python_logo_and_wordmark.svg/260px-Python_logo_and_wordmark.svg.png http://download.redis.io/logocontest/82.png http://icons.iconarchive.com/icons/fasticon/servers/128/server-icon.png
  4. dockeRization image https://www.iconfinder.com/icons/298878/terminal_icon#size=128
  5. http://img.pandawhale.com/post-52368-thanks-obama-making-sandwich-m-whnc.jpeg
  6. https://pbs.twimg.com/media/Buw8Bz6IIAAxgxg.png