How to Build Consistent and Scalable Workspaces for Data Science Teams

•Download as PPT, PDF•

1 like•397 views

This document discusses how to build consistent and scalable workspaces for data science teams. It recommends identifying system requirements, stabilizing dependencies, increasing test coverage, and using continuous integration to ensure resources are available. It also suggests creating a pool of worker machines and asynchronous task queue to scale workloads. This allows tasks to run in isolated, identical environments and provides flexible use of cloud computing resources. Benefits include guaranteed task environments, extensibility, and a reusable command line interface. Examples of use cases provided are quality assurance testing and parallelizable data and model tasks.

Software

How to build consistent, scalable
workspaces for data science teams
Elaine Lee

Data science is hard.
Doing data science is even harder.
Ensuring enough resourcesManaging dependencies
http://www.seriouseats.com/assets_c/2014/06/20140525-294370-best-deep-dish-pizza-art-of-pizza-
primary-thumb-1500xauto-404176.jpghttps://s-media-cache-ak0.pinimg.com/736x/91/6b/f0/916bf0f23660fc7019353800668060af.jpg

Nail it down
Identify system requirements for base Docker image
Stabilize dependencies for data science work environment
Increase test coverage
Get continuous integration (CI) platform on the same page

Scale it up
Create a pool of worker machines ready to accept jobs
Set up an asynchronous task queue
Provide a simple command line interface for data scientists

Putting it all together
Pull changes Start Docker
container
Run test suite Report Pass/Fail Export image for
commit
Commit pushed
to Github
Report resultGet image for
commit
Start container
from image
Run task
Request arrives
in queue
workers
123abc…123abc…
123abc…123abc…
s3

Benefits
Flexible to any
composition of EC2
instances
-Extensible to EMR
Task environment
guaranteed
-Isolated from other tasks
-Identical to conditions at
time of development
One-time configuration
-EC2 AMI
Extensible command line
interface
-R interface
-Cluster management
-Job monitoring

Use case: Quality assurance
CI testing
Other tests
- Data validation
- Model consistency
http://img.pandawhale.com/post-52368-thanks-obama-making-sandwich-m-whnc.jpeg

Use case: Parallelizable tasks
Data manipulation
- Feature engineering
Model builds
- Advanced machine learning algorithms
- Hyperparameter search
https://pbs.twimg.com/media/Buw8Bz6IIAAxgxg.png

Elaine Lee
Data Engineer
elaine@elaineklee.com
@elaineklee
avant.com

Krux is an infrastructure provider for many of the websites you use online today, like NYTimes.com, WSJ.com, Wikia and NBCU. For every request on those properties, Krux will get one or more as well. We grew from zero traffic to several billion requests per day in the span of 2 years, and we did so exclusively in AWS. To make the right decisions in such a volatile environment, we knew that data is everything; without it, you can't possibly make informed decisions. However, collecting it efficiently, at scale, at minimal cost and without burdening developers is a tremendous challenge. Join me in this session to learn how we overcame this challenge at Krux; I will share with you the details of how we set up our global infrastructure, entirely managed by Puppet, to capture over a million data points every second on virtually every part of the system, including inside the web server, user apps and Puppet itself, for under $2000/month using off the shelf Open Source software and some code we've released as Open Source ourselves. In addition, I’ll show you how you can take (a subset of) these metrics and send them to advanced analytics and alerting tools like Circonus or Zabbix. This content will be applicable for anyone collecting or desiring to collect vast amounts of metrics in a cloud or datacenter setting and making sense of them.

A Hybrid Approach to Data Science Project Management

Elaine K. Lee

1z0-997-21.pdf

MohamedHusseinEid

Achieving a Serverless Development Experience

Ivan Dwyer

Distributed Systems at Scale: Reducing the Fail

Kim Moir

Continuous delivery for databases

DevOpsGroup

Auto scaling with Ruby, AWS, Jenkins and Redis

Yi Hsuan (Jeddie) Chuang

Walking Through Spring Cloud Data Flow

VMware Tanzu

Apache PredictionIO 是一個開源 Machine Learning Server 架構，提供開發者及資料科學家能有效地快速建立所需的預測引擎，並且透過 REST 整合現有系統，達到 Machine Learning as a Service 的目標。我們將介紹如何整合 Hadoop Ecosystem 及 PredictionIO，有效協助使用者蒐集、儲存資料、訓練學習引擎及提供預測結果，幫助企業發掘問題、改善客戶需求預測等。

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016

MLconf

DL4J and DataVec for Enterprise Deep Learning Workflows: Applications in NLP, sensor processing (IoT), image processing, and audio processing have all emerged as prime deep learning applications. In this session we will take a look at a practical review of building practical and secure Deep Learning workflows in the enterprise. We’ll see how DL4J’s DataVec tool enables scalable ETL and vectorization pipelines to be created for a single machine or scale out to Spark on Hadoop. We’ll also see how Deep Networks such as Recurrent Neural Networks are able to leverage DataVec to more quickly process data for modeling.

Deep Learning: DL4J and DataVec

Josh Patterson

HighAvailabilityForSharepointJason Dover

Data science for infrastructure dev week 2022

ZainAsgar1

DevOps Pragmatic Overview

Mykola Marzhan

A pragmatic overview of the ‘old-but-current’ and ‘new-and-uncommon’ processes, methodologies and practices in the DevOps/Release Management area. Implementation of Continuous Delivery, Value Stream Mapping, Delivery Pipeline, Continuous Testing, Infrastructure as Code, Test-Driven Infrastructure, Developer Self-Service will be reviewed. Advantages and disadvantages of cloud platforms for the DevOps process will be briefly reviewed.

1z0-997-20-oci-professional-incomplete.pdf

MohamedHusseinEid

Velocity Report 2009Naoya Nakazawa

Camel on Cloud by Christina Lin

Tadayoshi Sato

Sciences PO

Cisco Case Studies

Pm440 Presentation Black Cloud

guesta946d0

1z0-997-21 (4).pdf

MohamedHusseinEid

Cloud foundry: The Platform for Forging Cloud Native Applications

Chip Childers

It wasn’t too long ago that artisans, bathed in the glow of molten metal, forged parts that would go on to make up bigger, more powerful machines. Today, we call those artisans developers. Instead of metal, they use bits and bytes in the cloud to forge a modern application architecture that supports public, private and hybrid application deployment. One that enables users and developers to move their applications wherever they need to go. And it’s built on a growing, vibrant ecosystem. Nowhere is this epic shift in how things are made more visible than the meteoric adoption of Cloud Foundry. In this talk, Chip Childers, VP of Technology for Cloud Foundry Foundation, will give attendees an inside look at the industry movements and the technological requirements that are driving Cloud Foundry's rapid adoption. Most importantly, he will walk through how organizations are responding to the challenge of continuous innovation, what's driving modern application architectures, and how the Cloud Foundry platform uses specific constraints in order to fulfill it's promise to application owners.

DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...

Cisco DevNet

Continuous Deployment: The Dirty Details

Mike Brittain

Cracking the code review at SpringIO 2024

Paco van Beckhoven

Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production. Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process? In this session we will cover: - The Art of Effective Code Reviews - Streamlining the Review Process - Elevating Reviews with Automated Tools By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces

May Marketo Masterclass, London MUG May 22 2024.pdf

Adele Miller

Similar to How to Build Consistent and Scalable Workspaces for Data Science Teams

Introduction and CloudStack news

ShapeBlue

Spring and Pivotal Application Service - SpringOne Tour Dallas

VMware Tanzu

Evolution is Continuous, and so are Big Data and Streaming Pipelines

Databricks

Arquitectura en detalle de una anatomia devops

Orlando Chamorro

Anatomy of a Continuous Integration and Delivery (CICD) Pipeline

Robert McDermott

Prediction io 架構與整合 -DataCon.TW-2017

William Lee

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016

MLconf

Deep Learning: DL4J and DataVec

Josh Patterson

HighAvailabilityForSharepointJason Dover

Data science for infrastructure dev week 2022

ZainAsgar1

DevOps Pragmatic Overview

Mykola Marzhan

1z0-997-20-oci-professional-incomplete.pdf

MohamedHusseinEid

Velocity Report 2009Naoya Nakazawa

Camel on Cloud by Christina Lin

Tadayoshi Sato

Sciences PO

Cisco Case Studies

Pm440 Presentation Black Cloud

guesta946d0

1z0-997-21 (4).pdf

MohamedHusseinEid

Cloud foundry: The Platform for Forging Cloud Native Applications

Chip Childers

DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...

Cisco DevNet

Continuous Deployment: The Dirty Details

Mike Brittain

Similar to How to Build Consistent and Scalable Workspaces for Data Science Teams (20)

Introduction and CloudStack news

Spring and Pivotal Application Service - SpringOne Tour Dallas

Evolution is Continuous, and so are Big Data and Streaming Pipelines

Arquitectura en detalle de una anatomia devops

Anatomy of a Continuous Integration and Delivery (CICD) Pipeline

Prediction io 架構與整合 -DataCon.TW-2017

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016

Deep Learning: DL4J and DataVec

HighAvailabilityForSharepoint

Data science for infrastructure dev week 2022

DevOps Pragmatic Overview

1z0-997-20-oci-professional-incomplete.pdf

Velocity Report 2009

Camel on Cloud by Christina Lin

Sciences PO

Pm440 Presentation Black Cloud

1z0-997-21 (4).pdf

Cloud foundry: The Platform for Forging Cloud Native Applications

DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...

Continuous Deployment: The Dirty Details

Recently uploaded

Cracking the code review at SpringIO 2024

Paco van Beckhoven

May Marketo Masterclass, London MUG May 22 2024.pdf

Adele Miller

Empowering Growth with Best Software Development Company in Noida - Deuglo

Deuglo Infosystem Pvt Ltd

Do you want Software for your Business? Visit Deuglo Deuglo has top Software Developers in India. They are experts in software development and help design and create custom Software solutions. Deuglo follows seven steps methods for delivering their services to their customers. They called it the Software development life cycle process (SDLC). Requirement — Collecting the Requirements is the first Phase in the SSLC process. Feasibility Study — after completing the requirement process they move to the design phase. Design — in this phase, they start designing the software. Coding — when designing is completed, the developers start coding for the software. Testing — in this phase when the coding of the software is done the testing team will start testing. Installation — after completion of testing, the application opens to the live server and launches! Maintenance — after completing the software development, customers start using the software.

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf

timtebeek1

2024 eCommerceDays Toulouse - Sylius 2.0.pdf

Łukasz Chruściel

OpenMetadata Community Meeting - 5th June 2024

OpenMetadata

The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features. * How to run your own data quality framework * What is the performance impact of running data quality frameworks * How to run the test cases in your own ETL pipelines * How the Incident Manager is integrated * Get notified with alerts when test cases fail Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E

Quarkus Hidden and Forbidden Extensions

Max Andersen

Atelier - Innover avec l’IA Générative et les graphes de connaissances

Neo4j

Atelier - Innover avec l’IA Générative et les graphes de connaissances Allez au-delà du battage médiatique autour de l’IA et découvrez des techniques pratiques pour utiliser l’IA de manière responsable à travers les données de votre organisation. Explorez comment utiliser les graphes de connaissances pour augmenter la précision, la transparence et la capacité d’explication dans les systèmes d’IA générative. Vous partirez avec une expérience pratique combinant les relations entre les données et les LLM pour apporter du contexte spécifique à votre domaine et améliorer votre raisonnement. Amenez votre ordinateur portable et nous vous guiderons sur la mise en place de votre propre pile d’IA générative, en vous fournissant des exemples pratiques et codés pour démarrer en quelques minutes.

Introducing Crescat - Event Management Software for Venues, Festivals and Eve...

Crescat

Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry. Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events. With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use. Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements. If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io

AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App

Google

AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App 👉👉 Click Here To Get More Info 👇👇 https://sumonreview.com/ai-fusion-buddy-review AI Fusion Buddy Review: Key Features ✅Create Stunning AI App Suite Fully Powered By Google's Latest AI technology, Gemini ✅Use Gemini to Build high-converting Converting Sales Video Scripts, ad copies, Trending Articles, blogs, etc.100% unique! ✅Create Ultra-HD graphics with a single keyword or phrase that commands 10x eyeballs! ✅Fully automated AI articles bulk generation! ✅Auto-post or schedule stunning AI content across all your accounts at once—WordPress, Facebook, LinkedIn, Blogger, and more. ✅With one keyword or URL, generate complete websites, landing pages, and more… ✅Automatically create & sell AI content, graphics, websites, landing pages, & all that gets you paid non-stop 24*7. ✅Pre-built High-Converting 100+ website Templates and 2000+ graphic templates logos, banners, and thumbnail images in Trending Niches. ✅Say goodbye to wasting time logging into multiple Chat GPT & AI Apps once & for all! ✅Save over $5000 per year and kick out dependency on third parties completely! ✅Brand New App: Not available anywhere else! ✅ Beginner-friendly! ✅ZERO upfront cost or any extra expenses ✅Risk-Free: 30-Day Money-Back Guarantee! ✅Commercial License included! See My Other Reviews Article: (1) AI Genie Review: https://sumonreview.com/ai-genie-review (2) SocioWave Review: https://sumonreview.com/sociowave-review (3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review (4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review #AIFusionBuddyReview, #AIFusionBuddyFeatures, #AIFusionBuddyPricing, #AIFusionBuddyProsandCons, #AIFusionBuddyTutorial, #AIFusionBuddyUserExperience #AIFusionBuddyforBeginners, #AIFusionBuddyBenefits, #AIFusionBuddyComparison, #AIFusionBuddyInstallation, #AIFusionBuddyRefundPolicy, #AIFusionBuddyDemo, #AIFusionBuddyMaintenanceFees, #AIFusionBuddyNewbieFriendly, #WhatIsAIFusionBuddy?, #HowDoesAIFusionBuddyWorks

LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM

lorraineandreiamcidl

Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management

Utilocate

Utilocate offers a comprehensive solution for locate ticket management by automating and streamlining the entire process. By integrating with Geospatial Information Systems (GIS), it provides accurate mapping and visualization of utility locations, enhancing decision-making and reducing the risk of errors. The system's advanced data analytics tools help identify trends, predict potential issues, and optimize resource allocation, making the locate ticket management process smarter and more efficient. Additionally, automated ticket management ensures consistency and reduces human error, while real-time notifications keep all relevant personnel informed and ready to respond promptly. The system's ability to streamline workflows and automate ticket routing significantly reduces the time taken to process each ticket, making the process faster and more efficient. Mobile access allows field technicians to update ticket information on the go, ensuring that the latest information is always available and accelerating the locate process. Overall, Utilocate not only enhances the efficiency and accuracy of locate ticket management but also improves safety by minimizing the risk of utility damage through precise and timely locates.

E-commerce Application Development Company.pdf

Hornet Dynamics

Essentials of Automations: The Art of Triggers and Actions in FME

Safe Software

In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation. We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios. Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!

A Sighting of filterA in Typelevel Rite of Passage

Philip Schwarz

Fundamentals of Programming and Language Processors

Rakesh Kumar R

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

Aftab Hussain

Understanding variable roles in code has been found to be helpful by students in learning programming -- could variable roles help deep neural models in performing coding tasks? We do an exploratory study. - These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite

Google

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite 👉👉 Click Here To Get More Info 👇👇 https://sumonreview.com/ai-pilot-review/ AI Pilot Review: Key Features ✅Deploy AI expert bots in Any Niche With Just A Click ✅With one keyword, generate complete funnels, websites, landing pages, and more. ✅More than 85 AI features are included in the AI pilot. ✅No setup or configuration; use your voice (like Siri) to do whatever you want. ✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It… ✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again. ✅ZERO Limits On Features Or Usages ✅Use Our AI-powered Traffic To Get Hundreds Of Customers ✅No Complicated Setup: Get Up And Running In 2 Minutes ✅99.99% Up-Time Guaranteed ✅30 Days Money-Back Guarantee ✅ZERO Upfront Cost See My Other Reviews Article: (1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review (2) SocioWave Review: https://sumonreview.com/sociowave-review (3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review (4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review

AI Genie Review: World’s First Open AI WordPress Website Creator

Google

AI Genie Review: World’s First Open AI WordPress Website Creator 👉👉 Click Here To Get More Info 👇👇 https://sumonreview.com/ai-genie-review AI Genie Review: Key Features ✅Creates Limitless Real-Time Unique Content, auto-publishing Posts, Pages & Images directly from Chat GPT & Open AI on WordPress in any Niche ✅First & Only Google Bard Approved Software That Publishes 100% Original, SEO Friendly Content using Open AI ✅Publish Automated Posts and Pages using AI Genie directly on Your website ✅50 DFY Websites Included Without Adding Any Images, Content Or Doing Anything Yourself ✅Integrated Chat GPT Bot gives Instant Answers on Your Website to Visitors ✅Just Enter the title, and your Content for Pages and Posts will be ready on your website ✅Automatically insert visually appealing images into posts based on keywords and titles. ✅Choose the temperature of the content and control its randomness. ✅Control the length of the content to be generated. ✅Never Worry About Paying Huge Money Monthly To Top Content Creation Platforms ✅100% Easy-to-Use, Newbie-Friendly Technology ✅30-Days Money-Back Guarantee See My Other Reviews Article: (1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review (2) SocioWave Review: https://sumonreview.com/sociowave-review (3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review (4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review #AIGenieApp #AIGenieBonus #AIGenieBonuses #AIGenieDemo #AIGenieDownload #AIGenieLegit #AIGenieLiveDemo #AIGenieOTO #AIGeniePreview #AIGenieReview #AIGenieReviewandBonus #AIGenieScamorLegit #AIGenieSoftware #AIGenieUpgrades #AIGenieUpsells #HowDoesAlGenie #HowtoBuyAIGenie #HowtoMakeMoneywithAIGenie #MakeMoneyOnline #MakeMoneywithAIGenie

Launch Your Streaming Platforms in Minutes

Roshan Dwivedi

The claim of launching a streaming platform in minutes might be a bit of an exaggeration, but there are services that can significantly streamline the process. Here's a breakdown: Pros of Speedy Streaming Platform Launch Services: No coding required: These services often use drag-and-drop interfaces or pre-built templates, eliminating the need for programming knowledge. Faster setup: Compared to building from scratch, these platforms can get you up and running much quicker. All-in-one solutions: Many services offer features like content management systems (CMS), video players, and monetization tools, reducing the need for multiple integrations. Things to Consider: Limited customization: These platforms may offer less flexibility in design and functionality compared to custom-built solutions. Scalability: As your audience grows, you might need to upgrade to a more robust platform or encounter limitations with the "quick launch" option. Features: Carefully evaluate which features are included and if they meet your specific needs (e.g., live streaming, subscription options). Examples of Services for Launching Streaming Platforms: Muvi [muvi com] Uscreen [usencreen tv] Alternatives to Consider: Existing Streaming platforms: Platforms like YouTube or Twitch might be suitable for basic streaming needs, though monetization options might be limited. Custom Development: While more time-consuming, custom development offers the most control and flexibility for your platform. Overall, launching a streaming platform in minutes might not be entirely realistic, but these services can significantly speed up the process compared to building from scratch. Carefully consider your needs and budget when choosing the best option for you.

Recently uploaded (20)

Cracking the code review at SpringIO 2024

May Marketo Masterclass, London MUG May 22 2024.pdf

Empowering Growth with Best Software Development Company in Noida - Deuglo

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf

2024 eCommerceDays Toulouse - Sylius 2.0.pdf

OpenMetadata Community Meeting - 5th June 2024

Quarkus Hidden and Forbidden Extensions

Atelier - Innover avec l’IA Générative et les graphes de connaissances

Introducing Crescat - Event Management Software for Venues, Festivals and Eve...

AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App

LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM

Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management

E-commerce Application Development Company.pdf

Essentials of Automations: The Art of Triggers and Actions in FME

A Sighting of filterA in Typelevel Rite of Passage

Fundamentals of Programming and Language Processors

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite

AI Genie Review: World’s First Open AI WordPress Website Creator

Launch Your Streaming Platforms in Minutes

How to Build Consistent and Scalable Workspaces for Data Science Teams

1. How to build consistent, scalable workspaces for data science teams Elaine Lee

2. Data science is hard. Doing data science is even harder. Ensuring enough resourcesManaging dependencies http://www.seriouseats.com/assets_c/2014/06/20140525-294370-best-deep-dish-pizza-art-of-pizza- primary-thumb-1500xauto-404176.jpghttps://s-media-cache-ak0.pinimg.com/736x/91/6b/f0/916bf0f23660fc7019353800668060af.jpg

3. Nail it down Identify system requirements for base Docker image Stabilize dependencies for data science work environment Increase test coverage Get continuous integration (CI) platform on the same page

4. Scale it up Create a pool of worker machines ready to accept jobs Set up an asynchronous task queue Provide a simple command line interface for data scientists

5. Putting it all together Pull changes Start Docker container Run test suite Report Pass/Fail Export image for commit Commit pushed to Github Report resultGet image for commit Start container from image Run task Request arrives in queue workers 123abc…123abc… 123abc…123abc… s3

6. Benefits Flexible to any composition of EC2 instances -Extensible to EMR Task environment guaranteed -Isolated from other tasks -Identical to conditions at time of development One-time configuration -EC2 AMI Extensible command line interface -R interface -Cluster management -Job monitoring

7. Use case: Quality assurance CI testing Other tests - Data validation - Model consistency http://img.pandawhale.com/post-52368-thanks-obama-making-sandwich-m-whnc.jpeg

8. Use case: Parallelizable tasks Data manipulation - Feature engineering Model builds - Advanced machine learning algorithms - Hyperparameter search https://pbs.twimg.com/media/Buw8Bz6IIAAxgxg.png

9. Elaine Lee Data Engineer elaine@elaineklee.com @elaineklee avant.com

10. Elaine Lee Data Engineer elaine@elaineklee.com @elaineklee avant.com

Editor's Notes

http://static.techspot.com/articles-info/788/images/world-wide-web-25-years-super.jpg http://www.seriouseats.com/assets_c/2014/06/20140525-294370-best-deep-dish-pizza-art-of-pizza-primary-thumb-1500xauto-404176.jpg https://s-media-cache-ak0.pinimg.com/736x/91/6b/f0/916bf0f23660fc7019353800668060af.jpg
https://utbrudd.bouvet.no/wp-content/uploads/2015/02/jenkins-docker.png https://www.r-project.org/Rlogo.png
https://upload.wikimedia.org/wikipedia/commons/thumb/1/1d/AmazonWebservices_Logo.svg/2000px-AmazonWebservices_Logo.svg.png http://www.netuitive.com/wp-content/uploads/integration_logo_celery_new.png https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Python_logo_and_wordmark.svg/260px-Python_logo_and_wordmark.svg.png http://download.redis.io/logocontest/82.png http://icons.iconarchive.com/icons/fasticon/servers/128/server-icon.png
dockeRization image https://www.iconfinder.com/icons/298878/terminal_icon#size=128
http://img.pandawhale.com/post-52368-thanks-obama-making-sandwich-m-whnc.jpeg
https://pbs.twimg.com/media/Buw8Bz6IIAAxgxg.png

How to Build Consistent and Scalable Workspaces for Data Science Teams

Recommended

Recommended

More Related Content

Similar to How to Build Consistent and Scalable Workspaces for Data Science Teams

Similar to How to Build Consistent and Scalable Workspaces for Data Science Teams (20)

Recently uploaded

Recently uploaded (20)

How to Build Consistent and Scalable Workspaces for Data Science Teams

Editor's Notes