Scaling Data Science
At Stitch Fix
Stefan Krawczyk
@stefkrawczyk
linkedin.com/in/skrawczyk
January 2017
How many
Data Scientists do you have?
At Stitch Fix we have ~80
Two Data Scientist facts:
1. Ability to spin up their own
resources*.
2. End to end,
they’re responsible.
But what do they do?
What is Stitch Fix?
~4500 Job Definitions
Lots of Compute &
Data Movement!
So how did we get to our scale?
Reducing Contention
&
Unhappy Data Scientists Burning Infrastructure
Contention is Correlated with
Contention on:
● Access to Data
● Access to Compute Res.
Contention on:
● Access to Data
● Access to Compute Res.
○ Ad-hoc
○ Production
Contention on:
● Access to Data
● Access to Compute Res.
○ Ad-hoc
○ Production
Focus of this talk:
Fellow Collaborators
jeff akshay jacob
tarek
kurt derek
patrick
thomas
Horizontal team focused on Data Scientist Enablement
steven liz alex
Data Access:
Unhappy DS &
Burning Infrastructure
Data Access: ☹ DS & Infrastructure
Data Access: ☹ DS & Infrastructure
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
These two interact
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
These two interact
Not enough space
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
These two interact
Not enough space
Limited by
tools
So how does Stitch Fix
mitigate these problems?
Data Access:
S3 & Hive Metastore
What is S3?
● Amazon’s Simple Storage Service.
● Infinite* storage.
● Looks like a file system*:
○ URIs: my.bucket/path/to/files/file.txt
● Can read, write, delete, BUT NOT append (or overwrite).
● Lots of companies rely on it -- famously Dropbox.
What is S3?
* For all intents and purposes
S3 @ Stitch Fix
S3
Writing Data Hard to Saturate
Reading Data Hard to Saturate
Writing & Reading
Interference
Haven’t Experienced
Space “Infinite”
Tooling Lots of Options
● Data Scientists’ main datastore since very early on.
● S3 essentially removes any real worries with respect to data contention!
S3 is not a complete solution!
What is the Hive Metastore?
● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
What is the Hive Metastore?
● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
Hive Metastore:
What is the Hive Metastore?
Partition Location
20161001 s3://bucket/sold_items/20161001
...
20161031 s3://bucket/sold_items/20161031
sold_items
Hive Metastore @ Stitch Fix
Brought into:
● Bring centralized order to data being stored on S3
● Provide metadata to build more tooling on top of
● Enable use of existing open source solutions
● Our central source of truth!
● Never have to worry about space.
● Trading for immediate speed, you have consistent read & write performance.
○ “Contention Free”
● Decoupled data storage layer from data manipulation.
○ Very amenable to supporting a lot of different data sets and tools.
S3 + Hive Metastore
Our Current Picture
Caveat: Eventual Consistency
● Replacing data in a partition
Caveat: Eventual Consistency
● Replacing data in a partition
Caveat: Eventual Consistency
Replacing a file on S3
B
A
Replacing a file on S3
● S3 is eventually
consistent*
● These bugs are hard
to track down
● Need everyone to be
able to trust the data.
A
B
* for existing files
● Use Hive Metastore to easily control partition source of truth
● Principles:
○ Never delete
○ Always write to a new place each time a partition changes
● What do we mean by “new place”?
○ Use an inner directory → called Batch ID
Avoiding Eventual Consistency
Batch ID Pattern
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
sold_items
● Overwriting a partition is just a matter of updating the location
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
s3://bucket/sold_items/
20161031/20161102234252/
sold_items
● Overwriting a partition is just a matter of updating the location
● To the user this is a hidden inner directory
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
s3://bucket/sold_items/
20161031/20161102234252/
sold_items
● Avoids eventual consistency issue
● Jobs finish on the data they started on
● Full partition history:
○ Can rollback
■ Data Scientists are less afraid of mistakes
○ Can create audit trails more easily
■ What data changed and when
○ Can anchor downstream consumers to a particular batch ID
Batch ID Pattern Benefits
Data Access:
Tooling Integration
Recall
Recall
?
?
?
?
?
?
Data Access:
Tooling Integration
1. Enforcing Batch IDs
2. File Formats
3. Schemas for all Tools
4. Schema Evolution
5. Redshift
6. Spark
● Problem:
○ How do you enforce remembering to add a Batch ID into your S3 path?
1. Enforcing Batch IDs
● Problem:
○ How do you enforce remembering to add a Batch ID into your S3 path?
● Solution:
○ By building APIs
■ For all tooling!
1. Enforcing Batch IDs
1. Enforcing Batch IDs via an API
1. Enforcing Batch IDs via an API
Python:
store_dataframe(df, dest_db, dest_table, partitions=[‘2016’])
df = load_dataframe(src_db, src_table, partitions=[‘2016’])
R:
sf_writer(data = result,
namespace = dest_db,
resource = dest_table,
partitions = c(as.integer(opt$ETL_DATE)))
df <- sf_reader(namespace = src_db,
resource = src_table,
partitions = c(as.integer(opt$ETL_DATE)))
1. Enforcing Batch IDs: APIs for DS
1. Enforcing Batch IDs: APIs for DS
Tool Reading From S3+HM Writing to S3+HM
Python Internal API Internal API
R Internal API Internal API
Spark Standard API Internal API
PySpark Standard API Internal API
Presto Standard API N/A
Redshift Load via Internal API N/A
● Problem:
○ What format do you use to work with all the tools?
2. File Format
● Problem:
○ What format do you use to work with all the tools?
● Possible solutions:
○ Parquet
○ Some simple format {JSON, Delimited File} + gzip
○ Avro, Thrift, Protobuffers
2. File Format
● Problem:
○ What format do you use to work with all the tools?
● Possible solutions:
○ Parquet
○ Some simple format {JSON, Delimited File} + gzip
○ Avro, Thrift, Protobuffers
● Philosophy: minimize for operational burden:
○ Choose `0`, i.e. null delimited, gzipped files
■ Easy to write an API for this, for all tools.
2. File Format
● Problem:
○ Can’t necessarily have a single schema for all tools
■ E.g.
● Different type definitions.
3. Schemas for all Tools
● Problem:
○ Can’t necessarily have a single schema for all tools
■ E.g.
● Different type definitions.
● Solution:
○ Define parallel schemas, that have specific types redefined in Hive
Metastore
■ E.g.
● Can redefine decimal type to be double for Presto*.
● This parallel schema would be named prod_presto.
○ Still points to same underlying data.
3. Schemas for all Tools
* It didn’t use to have functioning decimal support
● Problem:
○ How do you handle schema evolution with 80+ Data Scientists?
■ E.g.
● Add a new column
● Delete an old column
4. Schema Evolution
● Problem:
○ How do you handle schema evolution with 80+ Data Scientists?
■ E.g.
● Add a new column
● Delete an old column
● Solution:
○ Append columns to end of schemas.
○ Rename columns as deprecated -- breaks code, but not data.
4. Schema Evolution
● Wait, what? Redshift?
5. Redshift
● Wait, what? Redshift?
○ Predates use of Spark & Presto
○ Redshift was brought in to help joining data
■ Previously DS had to load data & perform joins in R/Python
○ Data Scientists loved Redshift too much:
■ It became a huge source of contention
■ Have been migrating “production” off of it
5. Redshift
● Need:
○ Still want to use Redshift for ad-hoc analysis
● Problem:
○ How do we keep data on S3 in sync with Redshift?
5. Redshift
● Need:
○ Still want to use Redshift for ad-hoc analysis
● Problem:
○ How do we keep data on S3 in sync with Redshift?
● Solution:
○ API that abstracts syncing data with Redshift
■ Keeps schemas in sync
■ Uses standard data warehouse staged table insertion pattern
5. Redshift
● What does our integration with Spark look like?
6. Spark
● What does our integration with Spark look like?
○ Running on Amazon EMR using Netflix's Genie
■ Prod & Dev clusters
○ S3 still source of truth
■ Have custom write API:
● Enforces Batch IDs
● Scala based library making use of EMRFS
● Also exposed in Python for PySpark use
○ Heavy users of Spark SQL
○ It’s the main production workhorse
6. Spark
Ad-hoc
Compute Access:
Using Docker
Data Scientist’s Ad-hoc workflow
Data Scientist’s Ad-hoc workflow
The faster this iteration cycle, the faster Data Scientists can work
Data Scientist’s Ad-hoc workflow
Scaling this part
The faster this iteration cycle, the faster Data Scientists can work
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Medium
High High
Ad hoc Infra: Options
Laptop
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
High High
Ad hoc Infra: Options
Laptop
Shared
Instances
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
High Time & Money
Ad hoc Infra: Options
Laptop
Shared
Instances
Individual
Instances
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
Low Time & Money
Ad hoc Infra: Options
Laptop
Shared
Instances
Individual
Instances
● Control of environment
○ Data Scientists don’t need to worry about env.
● Isolation
○ can host many docker containers on a single machine.
● Better host management
○ allowing central control of machine types.
Why Docker?
● Has:
○ Our internal API libraries
○ Jupyter Hub Notebooks:
■ Pyspark, IPython, R, Javscript, Toree
○ Python libs:
■ scikit, numpy, scipy, pandas, etc.
○ RStudio
○ R libs:
■ Dplyr, magrittr, ggplot2, lme4, BOOT, etc.
● Mounts User NFS
● User has terminal access to file system via Jupyter for git, pip, etc.
Ad-Hoc Docker Image
Self Service Ad-hoc Infra: Flotilla
Jupyter Hub on Flotilla
RStudio on Flotilla
Browser Based Terminal on Flotilla
Flotilla Deployment
● Amazon ECS for cluster management.
● EC2 Instances:
○ Custom AMI based on ECS optimized docker image.
● Runs in a single Auto Scale Group.
● S3 backed self-hosted Artifactory as docker repository.
● Docker + Amazon ECS unlocks access to lots of CPU & Memory for DS!
Flotilla Deployment
Flotilla Deployment
Flotilla Deployment
● Docker tightly integrates with the Linux Kernel.
○ Hypothesis:
■ Anything that makes uninterruptable calls to the kernel can:
● Break the ECS agent because the container doesn’t respond.
● Break isolation between containers.
■ E.g. Mounting NFS
● Docker Hub:
○ Weren’t happy with performance
○ Switched to artifactory
Docker Problems So Far
In Summary
● S3 + Hive Metastore is Stitch Fix’s very scalable data warehouse.
● Internally built APIs make S3 + Hive Metastore easier to use for Data Scientists.
● Docker is used to provide a consistent environment for Data Scientists to use.
● Docker + ECS enables a self-service ad-hoc platform for Data Scientists.
In Summary - Reducing Contention
Fin; Thanks! Questions?
@stefkrawczyk
Try out Stitch Fix → stitchfix.com/referral/8406746

Data Day Texas 2017: Scaling Data Science at Stitch Fix

  • 1.
    Scaling Data Science AtStitch Fix Stefan Krawczyk @stefkrawczyk linkedin.com/in/skrawczyk January 2017
  • 2.
  • 3.
    At Stitch Fixwe have ~80
  • 4.
    Two Data Scientistfacts: 1. Ability to spin up their own resources*. 2. End to end, they’re responsible.
  • 5.
    But what dothey do?
  • 6.
  • 14.
  • 15.
    Lots of Compute& Data Movement!
  • 16.
    So how didwe get to our scale?
  • 17.
  • 18.
    & Unhappy Data ScientistsBurning Infrastructure Contention is Correlated with
  • 19.
    Contention on: ● Accessto Data ● Access to Compute Res.
  • 20.
    Contention on: ● Accessto Data ● Access to Compute Res. ○ Ad-hoc ○ Production
  • 21.
    Contention on: ● Accessto Data ● Access to Compute Res. ○ Ad-hoc ○ Production Focus of this talk:
  • 22.
    Fellow Collaborators jeff akshayjacob tarek kurt derek patrick thomas Horizontal team focused on Data Scientist Enablement steven liz alex
  • 23.
    Data Access: Unhappy DS& Burning Infrastructure
  • 24.
    Data Access: ☹DS & Infrastructure
  • 25.
    Data Access: ☹DS & Infrastructure
  • 26.
    Data Access: ☹DS & Infrastructure Can’t write fast enough
  • 27.
    Data Access: ☹DS & Infrastructure Can’t write fast enough Can’t read fast enough
  • 28.
    Data Access: ☹DS & Infrastructure Can’t write fast enough Can’t read fast enough These two interact
  • 29.
    Data Access: ☹DS & Infrastructure Can’t write fast enough Can’t read fast enough These two interact Not enough space
  • 30.
    Data Access: ☹DS & Infrastructure Can’t write fast enough Can’t read fast enough These two interact Not enough space Limited by tools
  • 31.
    So how doesStitch Fix mitigate these problems?
  • 32.
    Data Access: S3 &Hive Metastore
  • 33.
  • 34.
    ● Amazon’s SimpleStorage Service. ● Infinite* storage. ● Looks like a file system*: ○ URIs: my.bucket/path/to/files/file.txt ● Can read, write, delete, BUT NOT append (or overwrite). ● Lots of companies rely on it -- famously Dropbox. What is S3? * For all intents and purposes
  • 35.
    S3 @ StitchFix S3 Writing Data Hard to Saturate Reading Data Hard to Saturate Writing & Reading Interference Haven’t Experienced Space “Infinite” Tooling Lots of Options ● Data Scientists’ main datastore since very early on. ● S3 essentially removes any real worries with respect to data contention!
  • 36.
    S3 is nota complete solution!
  • 37.
    What is theHive Metastore?
  • 38.
    ● Hadoop service,that stores: ○ Schema ○ Partition information, e.g. date ○ Data location for a partition What is the Hive Metastore?
  • 39.
    ● Hadoop service,that stores: ○ Schema ○ Partition information, e.g. date ○ Data location for a partition Hive Metastore: What is the Hive Metastore? Partition Location 20161001 s3://bucket/sold_items/20161001 ... 20161031 s3://bucket/sold_items/20161031 sold_items
  • 40.
    Hive Metastore @Stitch Fix Brought into: ● Bring centralized order to data being stored on S3 ● Provide metadata to build more tooling on top of ● Enable use of existing open source solutions
  • 41.
    ● Our centralsource of truth! ● Never have to worry about space. ● Trading for immediate speed, you have consistent read & write performance. ○ “Contention Free” ● Decoupled data storage layer from data manipulation. ○ Very amenable to supporting a lot of different data sets and tools. S3 + Hive Metastore
  • 42.
  • 43.
  • 44.
    ● Replacing datain a partition Caveat: Eventual Consistency
  • 45.
    ● Replacing datain a partition Caveat: Eventual Consistency
  • 46.
  • 47.
    Replacing a fileon S3 ● S3 is eventually consistent* ● These bugs are hard to track down ● Need everyone to be able to trust the data. A B * for existing files
  • 48.
    ● Use HiveMetastore to easily control partition source of truth ● Principles: ○ Never delete ○ Always write to a new place each time a partition changes ● What do we mean by “new place”? ○ Use an inner directory → called Batch ID Avoiding Eventual Consistency
  • 49.
  • 50.
    Batch ID Pattern DateLocation 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ sold_items
  • 51.
    ● Overwriting apartition is just a matter of updating the location Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ s3://bucket/sold_items/ 20161031/20161102234252/ sold_items
  • 52.
    ● Overwriting apartition is just a matter of updating the location ● To the user this is a hidden inner directory Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ s3://bucket/sold_items/ 20161031/20161102234252/ sold_items
  • 53.
    ● Avoids eventualconsistency issue ● Jobs finish on the data they started on ● Full partition history: ○ Can rollback ■ Data Scientists are less afraid of mistakes ○ Can create audit trails more easily ■ What data changed and when ○ Can anchor downstream consumers to a particular batch ID Batch ID Pattern Benefits
  • 54.
  • 55.
  • 56.
  • 57.
    Data Access: Tooling Integration 1.Enforcing Batch IDs 2. File Formats 3. Schemas for all Tools 4. Schema Evolution 5. Redshift 6. Spark
  • 58.
    ● Problem: ○ Howdo you enforce remembering to add a Batch ID into your S3 path? 1. Enforcing Batch IDs
  • 59.
    ● Problem: ○ Howdo you enforce remembering to add a Batch ID into your S3 path? ● Solution: ○ By building APIs ■ For all tooling! 1. Enforcing Batch IDs
  • 60.
    1. Enforcing BatchIDs via an API
  • 61.
    1. Enforcing BatchIDs via an API
  • 62.
    Python: store_dataframe(df, dest_db, dest_table,partitions=[‘2016’]) df = load_dataframe(src_db, src_table, partitions=[‘2016’]) R: sf_writer(data = result, namespace = dest_db, resource = dest_table, partitions = c(as.integer(opt$ETL_DATE))) df <- sf_reader(namespace = src_db, resource = src_table, partitions = c(as.integer(opt$ETL_DATE))) 1. Enforcing Batch IDs: APIs for DS
  • 63.
    1. Enforcing BatchIDs: APIs for DS Tool Reading From S3+HM Writing to S3+HM Python Internal API Internal API R Internal API Internal API Spark Standard API Internal API PySpark Standard API Internal API Presto Standard API N/A Redshift Load via Internal API N/A
  • 64.
    ● Problem: ○ Whatformat do you use to work with all the tools? 2. File Format
  • 65.
    ● Problem: ○ Whatformat do you use to work with all the tools? ● Possible solutions: ○ Parquet ○ Some simple format {JSON, Delimited File} + gzip ○ Avro, Thrift, Protobuffers 2. File Format
  • 66.
    ● Problem: ○ Whatformat do you use to work with all the tools? ● Possible solutions: ○ Parquet ○ Some simple format {JSON, Delimited File} + gzip ○ Avro, Thrift, Protobuffers ● Philosophy: minimize for operational burden: ○ Choose `0`, i.e. null delimited, gzipped files ■ Easy to write an API for this, for all tools. 2. File Format
  • 67.
    ● Problem: ○ Can’tnecessarily have a single schema for all tools ■ E.g. ● Different type definitions. 3. Schemas for all Tools
  • 68.
    ● Problem: ○ Can’tnecessarily have a single schema for all tools ■ E.g. ● Different type definitions. ● Solution: ○ Define parallel schemas, that have specific types redefined in Hive Metastore ■ E.g. ● Can redefine decimal type to be double for Presto*. ● This parallel schema would be named prod_presto. ○ Still points to same underlying data. 3. Schemas for all Tools * It didn’t use to have functioning decimal support
  • 69.
    ● Problem: ○ Howdo you handle schema evolution with 80+ Data Scientists? ■ E.g. ● Add a new column ● Delete an old column 4. Schema Evolution
  • 70.
    ● Problem: ○ Howdo you handle schema evolution with 80+ Data Scientists? ■ E.g. ● Add a new column ● Delete an old column ● Solution: ○ Append columns to end of schemas. ○ Rename columns as deprecated -- breaks code, but not data. 4. Schema Evolution
  • 71.
    ● Wait, what?Redshift? 5. Redshift
  • 72.
    ● Wait, what?Redshift? ○ Predates use of Spark & Presto ○ Redshift was brought in to help joining data ■ Previously DS had to load data & perform joins in R/Python ○ Data Scientists loved Redshift too much: ■ It became a huge source of contention ■ Have been migrating “production” off of it 5. Redshift
  • 73.
    ● Need: ○ Stillwant to use Redshift for ad-hoc analysis ● Problem: ○ How do we keep data on S3 in sync with Redshift? 5. Redshift
  • 74.
    ● Need: ○ Stillwant to use Redshift for ad-hoc analysis ● Problem: ○ How do we keep data on S3 in sync with Redshift? ● Solution: ○ API that abstracts syncing data with Redshift ■ Keeps schemas in sync ■ Uses standard data warehouse staged table insertion pattern 5. Redshift
  • 75.
    ● What doesour integration with Spark look like? 6. Spark
  • 76.
    ● What doesour integration with Spark look like? ○ Running on Amazon EMR using Netflix's Genie ■ Prod & Dev clusters ○ S3 still source of truth ■ Have custom write API: ● Enforces Batch IDs ● Scala based library making use of EMRFS ● Also exposed in Python for PySpark use ○ Heavy users of Spark SQL ○ It’s the main production workhorse 6. Spark
  • 77.
  • 78.
  • 79.
    Data Scientist’s Ad-hocworkflow The faster this iteration cycle, the faster Data Scientists can work
  • 80.
    Data Scientist’s Ad-hocworkflow Scaling this part The faster this iteration cycle, the faster Data Scientists can work
  • 81.
    Workstation Env. Mgmt.Contention Points Low Memory & CPU Medium Medium High High Ad hoc Infra: Options Laptop
  • 82.
    Workstation Env. Mgmt.Contention Points Low Memory & CPU Medium Isolation High High Ad hoc Infra: Options Laptop Shared Instances
  • 83.
    Workstation Env. Mgmt.Contention Points Low Memory & CPU Medium Isolation High Time & Money Ad hoc Infra: Options Laptop Shared Instances Individual Instances
  • 84.
    Workstation Env. Mgmt.Contention Points Low Memory & CPU Medium Isolation Low Time & Money Ad hoc Infra: Options Laptop Shared Instances Individual Instances
  • 85.
    ● Control ofenvironment ○ Data Scientists don’t need to worry about env. ● Isolation ○ can host many docker containers on a single machine. ● Better host management ○ allowing central control of machine types. Why Docker?
  • 86.
    ● Has: ○ Ourinternal API libraries ○ Jupyter Hub Notebooks: ■ Pyspark, IPython, R, Javscript, Toree ○ Python libs: ■ scikit, numpy, scipy, pandas, etc. ○ RStudio ○ R libs: ■ Dplyr, magrittr, ggplot2, lme4, BOOT, etc. ● Mounts User NFS ● User has terminal access to file system via Jupyter for git, pip, etc. Ad-Hoc Docker Image
  • 87.
    Self Service Ad-hocInfra: Flotilla
  • 88.
  • 89.
  • 90.
  • 91.
    Flotilla Deployment ● AmazonECS for cluster management. ● EC2 Instances: ○ Custom AMI based on ECS optimized docker image. ● Runs in a single Auto Scale Group. ● S3 backed self-hosted Artifactory as docker repository. ● Docker + Amazon ECS unlocks access to lots of CPU & Memory for DS!
  • 92.
  • 93.
  • 94.
  • 95.
    ● Docker tightlyintegrates with the Linux Kernel. ○ Hypothesis: ■ Anything that makes uninterruptable calls to the kernel can: ● Break the ECS agent because the container doesn’t respond. ● Break isolation between containers. ■ E.g. Mounting NFS ● Docker Hub: ○ Weren’t happy with performance ○ Switched to artifactory Docker Problems So Far
  • 96.
  • 97.
    ● S3 +Hive Metastore is Stitch Fix’s very scalable data warehouse. ● Internally built APIs make S3 + Hive Metastore easier to use for Data Scientists. ● Docker is used to provide a consistent environment for Data Scientists to use. ● Docker + ECS enables a self-service ad-hoc platform for Data Scientists. In Summary - Reducing Contention
  • 98.
    Fin; Thanks! Questions? @stefkrawczyk Tryout Stitch Fix → stitchfix.com/referral/8406746