Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Scaling Data Science
At Stitch Fix
Stefan Krawczyk
@stefkrawczyk
linkedin.com/in/skrawczyk
January 2017
How many
Data Scientists do you have?
At Stitch Fix we have ~80
Two Data Scientist facts:
1. Ability to spin up their own
resources*.
2. End to end,
they’re responsible.
But what do they do?
What is Stitch Fix?
~4500 Job Definitions
Lots of Compute &
Data Movement!
So how did we get to our scale?
Reducing Contention
&
Unhappy Data Scientists Burning Infrastructure
Contention is Correlated with
Contention on:
● Access to Data
● Access to Compute Res.
Contention on:
● Access to Data
● Access to Compute Res.
○ Ad-hoc
○ Production
Contention on:
● Access to Data
● Access to Compute Res.
○ Ad-hoc
○ Production
Focus of this talk:
Fellow Collaborators
jeff akshay jacob
tarek
kurt derek
patrick
thomas
Horizontal team focused on Data Scientist Enablemen...
Data Access:
Unhappy DS &
Burning Infrastructure
Data Access: ☹ DS & Infrastructure
Data Access: ☹ DS & Infrastructure
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
These two interact
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
These two interact
Not enough space
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
These two interact
Not enough space
Limi...
So how does Stitch Fix
mitigate these problems?
Data Access:
S3 & Hive Metastore
What is S3?
● Amazon’s Simple Storage Service.
● Infinite* storage.
● Looks like a file system*:
○ URIs: my.bucket/path/to/files/file....
S3 @ Stitch Fix
S3
Writing Data Hard to Saturate
Reading Data Hard to Saturate
Writing & Reading
Interference
Haven’t Expe...
S3 is not a complete solution!
What is the Hive Metastore?
● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
What is the Hiv...
● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
Hive Metastore:...
Hive Metastore @ Stitch Fix
Brought into:
● Bring centralized order to data being stored on S3
● Provide metadata to build...
● Our central source of truth!
● Never have to worry about space.
● Trading for immediate speed, you have consistent read ...
Our Current Picture
Caveat: Eventual Consistency
● Replacing data in a partition
Caveat: Eventual Consistency
● Replacing data in a partition
Caveat: Eventual Consistency
Replacing a file on S3
B
A
Replacing a file on S3
● S3 is eventually
consistent*
● These bugs are hard
to track down
● Need everyone to be
able to tr...
● Use Hive Metastore to easily control partition source of truth
● Principles:
○ Never delete
○ Always write to a new plac...
Batch ID Pattern
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold...
● Overwriting a partition is just a matter of updating the location
Batch ID Pattern
Date Location
20161001 s3://bucket/so...
● Overwriting a partition is just a matter of updating the location
● To the user this is a hidden inner directory
Batch I...
● Avoids eventual consistency issue
● Jobs finish on the data they started on
● Full partition history:
○ Can rollback
■ D...
Data Access:
Tooling Integration
Recall
Recall
?
?
?
?
?
?
Data Access:
Tooling Integration
1. Enforcing Batch IDs
2. File Formats
3. Schemas for all Tools
4. Schema Evolution
5. Re...
● Problem:
○ How do you enforce remembering to add a Batch ID into your S3 path?
1. Enforcing Batch IDs
● Problem:
○ How do you enforce remembering to add a Batch ID into your S3 path?
● Solution:
○ By building APIs
■ For all ...
1. Enforcing Batch IDs via an API
1. Enforcing Batch IDs via an API
Python:
store_dataframe(df, dest_db, dest_table, partitions=[‘2016’])
df = load_dataframe(src_db, src_table, partitions=[‘...
1. Enforcing Batch IDs: APIs for DS
Tool Reading From S3+HM Writing to S3+HM
Python Internal API Internal API
R Internal A...
● Problem:
○ What format do you use to work with all the tools?
2. File Format
● Problem:
○ What format do you use to work with all the tools?
● Possible solutions:
○ Parquet
○ Some simple format {JSON...
● Problem:
○ What format do you use to work with all the tools?
● Possible solutions:
○ Parquet
○ Some simple format {JSON...
● Problem:
○ Can’t necessarily have a single schema for all tools
■ E.g.
● Different type definitions.
3. Schemas for all ...
● Problem:
○ Can’t necessarily have a single schema for all tools
■ E.g.
● Different type definitions.
● Solution:
○ Defin...
● Problem:
○ How do you handle schema evolution with 80+ Data Scientists?
■ E.g.
● Add a new column
● Delete an old column...
● Problem:
○ How do you handle schema evolution with 80+ Data Scientists?
■ E.g.
● Add a new column
● Delete an old column...
● Wait, what? Redshift?
5. Redshift
● Wait, what? Redshift?
○ Predates use of Spark & Presto
○ Redshift was brought in to help joining data
■ Previously DS ha...
● Need:
○ Still want to use Redshift for ad-hoc analysis
● Problem:
○ How do we keep data on S3 in sync with Redshift?
5. ...
● Need:
○ Still want to use Redshift for ad-hoc analysis
● Problem:
○ How do we keep data on S3 in sync with Redshift?
● S...
● What does our integration with Spark look like?
6. Spark
● What does our integration with Spark look like?
○ Running on Amazon EMR using Netflix's Genie
■ Prod & Dev clusters
○ S3...
Ad-hoc
Compute Access:
Using Docker
Data Scientist’s Ad-hoc workflow
Data Scientist’s Ad-hoc workflow
The faster this iteration cycle, the faster Data Scientists can work
Data Scientist’s Ad-hoc workflow
Scaling this part
The faster this iteration cycle, the faster Data Scientists can work
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Medium
High High
Ad hoc Infra: Options
Laptop
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
High High
Ad hoc Infra: Options
Laptop
Shared
I...
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
High Time & Money
Ad hoc Infra: Options
Laptop
...
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
Low Time & Money
Ad hoc Infra: Options
Laptop
S...
● Control of environment
○ Data Scientists don’t need to worry about env.
● Isolation
○ can host many docker containers on...
● Has:
○ Our internal API libraries
○ Jupyter Hub Notebooks:
■ Pyspark, IPython, R, Javscript, Toree
○ Python libs:
■ scik...
Self Service Ad-hoc Infra: Flotilla
Jupyter Hub on Flotilla
RStudio on Flotilla
Browser Based Terminal on Flotilla
Flotilla Deployment
● Amazon ECS for cluster management.
● EC2 Instances:
○ Custom AMI based on ECS optimized docker image...
Flotilla Deployment
Flotilla Deployment
Flotilla Deployment
● Docker tightly integrates with the Linux Kernel.
○ Hypothesis:
■ Anything that makes uninterruptable calls to the kernel...
In Summary
● S3 + Hive Metastore is Stitch Fix’s very scalable data warehouse.
● Internally built APIs make S3 + Hive Metastore easie...
Fin; Thanks! Questions?
@stefkrawczyk
Try out Stitch Fix → stitchfix.com/referral/8406746
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Upcoming SlideShare
Loading in …5
×

Data Day Texas 2017: Scaling Data Science at Stitch Fix

10,014 views

Published on

At Stitch Fix we have a lot of Data Scientists. Around eighty at last count. One reason why I think we have so many, is that we do things differently. To get their work done, Data Scientists have access to whatever resources they need (within reason), because they’re end to end responsible for their work; they collaborate with their business partners on objectives and then prototype, iterate, productionize, monitor and debug everything and anything required to get the output desired. They’re full data-stack data scientists!

The teams in the organization do a variety of different tasks:
- Clothing recommendations for clients.
- Clothes reordering recommendations.
- Time series analysis & forecasting of inventory, client segments, etc.
- Warehouse worker path routing.
- NLP.
… and more!

They’re also quite prolific at what they do -- we are approaching 4500 job definitions at last count. So one might be wondering now, how have we enabled them to get their jobs done without getting in the way of each other?

This is where the Data Platform teams comes into play. With the goal of lowering the cognitive overhead and engineering effort required on part of the Data Scientist, the Data Platform team tries to provide abstractions and infrastructure to help the Data Scientists. The relationship is a collaborative partnership, where the Data Scientist is free to make their own decisions and thus choose they way they do their work, and the onus then falls on the Data Platform team to convince Data Scientists to use their tools; the easiest way to do that is by designing the tools well.

In regard to scaling Data Science, the Data Platform team has helped establish some patterns and infrastructure that help alleviate contention. Contention on:
Access to Data
Access to Compute Resources:
Ad-hoc compute (think prototype, iterate, workspace)
Production compute (think where things are executed once they’re needed regularly)

For the talk (and this post) I only focused on how we reduced contention on Access to Data, & Access to Ad-hoc Compute to enable Data Science to scale at Stitch Fix. With that I invite you to take a look through the slides.

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Data Day Texas 2017: Scaling Data Science at Stitch Fix

  1. 1. Scaling Data Science At Stitch Fix Stefan Krawczyk @stefkrawczyk linkedin.com/in/skrawczyk January 2017
  2. 2. How many Data Scientists do you have?
  3. 3. At Stitch Fix we have ~80
  4. 4. Two Data Scientist facts: 1. Ability to spin up their own resources*. 2. End to end, they’re responsible.
  5. 5. But what do they do?
  6. 6. What is Stitch Fix?
  7. 7. ~4500 Job Definitions
  8. 8. Lots of Compute & Data Movement!
  9. 9. So how did we get to our scale?
  10. 10. Reducing Contention
  11. 11. & Unhappy Data Scientists Burning Infrastructure Contention is Correlated with
  12. 12. Contention on: ● Access to Data ● Access to Compute Res.
  13. 13. Contention on: ● Access to Data ● Access to Compute Res. ○ Ad-hoc ○ Production
  14. 14. Contention on: ● Access to Data ● Access to Compute Res. ○ Ad-hoc ○ Production Focus of this talk:
  15. 15. Fellow Collaborators jeff akshay jacob tarek kurt derek patrick thomas Horizontal team focused on Data Scientist Enablement steven liz alex
  16. 16. Data Access: Unhappy DS & Burning Infrastructure
  17. 17. Data Access: ☹ DS & Infrastructure
  18. 18. Data Access: ☹ DS & Infrastructure
  19. 19. Data Access: ☹ DS & Infrastructure Can’t write fast enough
  20. 20. Data Access: ☹ DS & Infrastructure Can’t write fast enough Can’t read fast enough
  21. 21. Data Access: ☹ DS & Infrastructure Can’t write fast enough Can’t read fast enough These two interact
  22. 22. Data Access: ☹ DS & Infrastructure Can’t write fast enough Can’t read fast enough These two interact Not enough space
  23. 23. Data Access: ☹ DS & Infrastructure Can’t write fast enough Can’t read fast enough These two interact Not enough space Limited by tools
  24. 24. So how does Stitch Fix mitigate these problems?
  25. 25. Data Access: S3 & Hive Metastore
  26. 26. What is S3?
  27. 27. ● Amazon’s Simple Storage Service. ● Infinite* storage. ● Looks like a file system*: ○ URIs: my.bucket/path/to/files/file.txt ● Can read, write, delete, BUT NOT append (or overwrite). ● Lots of companies rely on it -- famously Dropbox. What is S3? * For all intents and purposes
  28. 28. S3 @ Stitch Fix S3 Writing Data Hard to Saturate Reading Data Hard to Saturate Writing & Reading Interference Haven’t Experienced Space “Infinite” Tooling Lots of Options ● Data Scientists’ main datastore since very early on. ● S3 essentially removes any real worries with respect to data contention!
  29. 29. S3 is not a complete solution!
  30. 30. What is the Hive Metastore?
  31. 31. ● Hadoop service, that stores: ○ Schema ○ Partition information, e.g. date ○ Data location for a partition What is the Hive Metastore?
  32. 32. ● Hadoop service, that stores: ○ Schema ○ Partition information, e.g. date ○ Data location for a partition Hive Metastore: What is the Hive Metastore? Partition Location 20161001 s3://bucket/sold_items/20161001 ... 20161031 s3://bucket/sold_items/20161031 sold_items
  33. 33. Hive Metastore @ Stitch Fix Brought into: ● Bring centralized order to data being stored on S3 ● Provide metadata to build more tooling on top of ● Enable use of existing open source solutions
  34. 34. ● Our central source of truth! ● Never have to worry about space. ● Trading for immediate speed, you have consistent read & write performance. ○ “Contention Free” ● Decoupled data storage layer from data manipulation. ○ Very amenable to supporting a lot of different data sets and tools. S3 + Hive Metastore
  35. 35. Our Current Picture
  36. 36. Caveat: Eventual Consistency
  37. 37. ● Replacing data in a partition Caveat: Eventual Consistency
  38. 38. ● Replacing data in a partition Caveat: Eventual Consistency
  39. 39. Replacing a file on S3 B A
  40. 40. Replacing a file on S3 ● S3 is eventually consistent* ● These bugs are hard to track down ● Need everyone to be able to trust the data. A B * for existing files
  41. 41. ● Use Hive Metastore to easily control partition source of truth ● Principles: ○ Never delete ○ Always write to a new place each time a partition changes ● What do we mean by “new place”? ○ Use an inner directory → called Batch ID Avoiding Eventual Consistency
  42. 42. Batch ID Pattern
  43. 43. Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ sold_items
  44. 44. ● Overwriting a partition is just a matter of updating the location Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ s3://bucket/sold_items/ 20161031/20161102234252/ sold_items
  45. 45. ● Overwriting a partition is just a matter of updating the location ● To the user this is a hidden inner directory Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ s3://bucket/sold_items/ 20161031/20161102234252/ sold_items
  46. 46. ● Avoids eventual consistency issue ● Jobs finish on the data they started on ● Full partition history: ○ Can rollback ■ Data Scientists are less afraid of mistakes ○ Can create audit trails more easily ■ What data changed and when ○ Can anchor downstream consumers to a particular batch ID Batch ID Pattern Benefits
  47. 47. Data Access: Tooling Integration
  48. 48. Recall
  49. 49. Recall ? ? ? ? ? ?
  50. 50. Data Access: Tooling Integration 1. Enforcing Batch IDs 2. File Formats 3. Schemas for all Tools 4. Schema Evolution 5. Redshift 6. Spark
  51. 51. ● Problem: ○ How do you enforce remembering to add a Batch ID into your S3 path? 1. Enforcing Batch IDs
  52. 52. ● Problem: ○ How do you enforce remembering to add a Batch ID into your S3 path? ● Solution: ○ By building APIs ■ For all tooling! 1. Enforcing Batch IDs
  53. 53. 1. Enforcing Batch IDs via an API
  54. 54. 1. Enforcing Batch IDs via an API
  55. 55. Python: store_dataframe(df, dest_db, dest_table, partitions=[‘2016’]) df = load_dataframe(src_db, src_table, partitions=[‘2016’]) R: sf_writer(data = result, namespace = dest_db, resource = dest_table, partitions = c(as.integer(opt$ETL_DATE))) df <- sf_reader(namespace = src_db, resource = src_table, partitions = c(as.integer(opt$ETL_DATE))) 1. Enforcing Batch IDs: APIs for DS
  56. 56. 1. Enforcing Batch IDs: APIs for DS Tool Reading From S3+HM Writing to S3+HM Python Internal API Internal API R Internal API Internal API Spark Standard API Internal API PySpark Standard API Internal API Presto Standard API N/A Redshift Load via Internal API N/A
  57. 57. ● Problem: ○ What format do you use to work with all the tools? 2. File Format
  58. 58. ● Problem: ○ What format do you use to work with all the tools? ● Possible solutions: ○ Parquet ○ Some simple format {JSON, Delimited File} + gzip ○ Avro, Thrift, Protobuffers 2. File Format
  59. 59. ● Problem: ○ What format do you use to work with all the tools? ● Possible solutions: ○ Parquet ○ Some simple format {JSON, Delimited File} + gzip ○ Avro, Thrift, Protobuffers ● Philosophy: minimize for operational burden: ○ Choose `0`, i.e. null delimited, gzipped files ■ Easy to write an API for this, for all tools. 2. File Format
  60. 60. ● Problem: ○ Can’t necessarily have a single schema for all tools ■ E.g. ● Different type definitions. 3. Schemas for all Tools
  61. 61. ● Problem: ○ Can’t necessarily have a single schema for all tools ■ E.g. ● Different type definitions. ● Solution: ○ Define parallel schemas, that have specific types redefined in Hive Metastore ■ E.g. ● Can redefine decimal type to be double for Presto*. ● This parallel schema would be named prod_presto. ○ Still points to same underlying data. 3. Schemas for all Tools * It didn’t use to have functioning decimal support
  62. 62. ● Problem: ○ How do you handle schema evolution with 80+ Data Scientists? ■ E.g. ● Add a new column ● Delete an old column 4. Schema Evolution
  63. 63. ● Problem: ○ How do you handle schema evolution with 80+ Data Scientists? ■ E.g. ● Add a new column ● Delete an old column ● Solution: ○ Append columns to end of schemas. ○ Rename columns as deprecated -- breaks code, but not data. 4. Schema Evolution
  64. 64. ● Wait, what? Redshift? 5. Redshift
  65. 65. ● Wait, what? Redshift? ○ Predates use of Spark & Presto ○ Redshift was brought in to help joining data ■ Previously DS had to load data & perform joins in R/Python ○ Data Scientists loved Redshift too much: ■ It became a huge source of contention ■ Have been migrating “production” off of it 5. Redshift
  66. 66. ● Need: ○ Still want to use Redshift for ad-hoc analysis ● Problem: ○ How do we keep data on S3 in sync with Redshift? 5. Redshift
  67. 67. ● Need: ○ Still want to use Redshift for ad-hoc analysis ● Problem: ○ How do we keep data on S3 in sync with Redshift? ● Solution: ○ API that abstracts syncing data with Redshift ■ Keeps schemas in sync ■ Uses standard data warehouse staged table insertion pattern 5. Redshift
  68. 68. ● What does our integration with Spark look like? 6. Spark
  69. 69. ● What does our integration with Spark look like? ○ Running on Amazon EMR using Netflix's Genie ■ Prod & Dev clusters ○ S3 still source of truth ■ Have custom write API: ● Enforces Batch IDs ● Scala based library making use of EMRFS ● Also exposed in Python for PySpark use ○ Heavy users of Spark SQL ○ It’s the main production workhorse 6. Spark
  70. 70. Ad-hoc Compute Access: Using Docker
  71. 71. Data Scientist’s Ad-hoc workflow
  72. 72. Data Scientist’s Ad-hoc workflow The faster this iteration cycle, the faster Data Scientists can work
  73. 73. Data Scientist’s Ad-hoc workflow Scaling this part The faster this iteration cycle, the faster Data Scientists can work
  74. 74. Workstation Env. Mgmt. Contention Points Low Memory & CPU Medium Medium High High Ad hoc Infra: Options Laptop
  75. 75. Workstation Env. Mgmt. Contention Points Low Memory & CPU Medium Isolation High High Ad hoc Infra: Options Laptop Shared Instances
  76. 76. Workstation Env. Mgmt. Contention Points Low Memory & CPU Medium Isolation High Time & Money Ad hoc Infra: Options Laptop Shared Instances Individual Instances
  77. 77. Workstation Env. Mgmt. Contention Points Low Memory & CPU Medium Isolation Low Time & Money Ad hoc Infra: Options Laptop Shared Instances Individual Instances
  78. 78. ● Control of environment ○ Data Scientists don’t need to worry about env. ● Isolation ○ can host many docker containers on a single machine. ● Better host management ○ allowing central control of machine types. Why Docker?
  79. 79. ● Has: ○ Our internal API libraries ○ Jupyter Hub Notebooks: ■ Pyspark, IPython, R, Javscript, Toree ○ Python libs: ■ scikit, numpy, scipy, pandas, etc. ○ RStudio ○ R libs: ■ Dplyr, magrittr, ggplot2, lme4, BOOT, etc. ● Mounts User NFS ● User has terminal access to file system via Jupyter for git, pip, etc. Ad-Hoc Docker Image
  80. 80. Self Service Ad-hoc Infra: Flotilla
  81. 81. Jupyter Hub on Flotilla
  82. 82. RStudio on Flotilla
  83. 83. Browser Based Terminal on Flotilla
  84. 84. Flotilla Deployment ● Amazon ECS for cluster management. ● EC2 Instances: ○ Custom AMI based on ECS optimized docker image. ● Runs in a single Auto Scale Group. ● S3 backed self-hosted Artifactory as docker repository. ● Docker + Amazon ECS unlocks access to lots of CPU & Memory for DS!
  85. 85. Flotilla Deployment
  86. 86. Flotilla Deployment
  87. 87. Flotilla Deployment
  88. 88. ● Docker tightly integrates with the Linux Kernel. ○ Hypothesis: ■ Anything that makes uninterruptable calls to the kernel can: ● Break the ECS agent because the container doesn’t respond. ● Break isolation between containers. ■ E.g. Mounting NFS ● Docker Hub: ○ Weren’t happy with performance ○ Switched to artifactory Docker Problems So Far
  89. 89. In Summary
  90. 90. ● S3 + Hive Metastore is Stitch Fix’s very scalable data warehouse. ● Internally built APIs make S3 + Hive Metastore easier to use for Data Scientists. ● Docker is used to provide a consistent environment for Data Scientists to use. ● Docker + ECS enables a self-service ad-hoc platform for Data Scientists. In Summary - Reducing Contention
  91. 91. Fin; Thanks! Questions? @stefkrawczyk Try out Stitch Fix → stitchfix.com/referral/8406746

×