SlideShare a Scribd company logo
1 of 98
Download to read offline
Scaling Data Science
At Stitch Fix
Stefan Krawczyk
@stefkrawczyk
linkedin.com/in/skrawczyk
January 2017
How many
Data Scientists do you have?
At Stitch Fix we have ~80
Two Data Scientist facts:
1. Ability to spin up their own
resources*.
2. End to end,
they’re responsible.
But what do they do?
What is Stitch Fix?
~4500 Job Definitions
Lots of Compute &
Data Movement!
So how did we get to our scale?
Reducing Contention
&
Unhappy Data Scientists Burning Infrastructure
Contention is Correlated with
Contention on:
● Access to Data
● Access to Compute Res.
Contention on:
● Access to Data
● Access to Compute Res.
○ Ad-hoc
○ Production
Contention on:
● Access to Data
● Access to Compute Res.
○ Ad-hoc
○ Production
Focus of this talk:
Fellow Collaborators
jeff akshay jacob
tarek
kurt derek
patrick
thomas
Horizontal team focused on Data Scientist Enablement
steven liz alex
Data Access:
Unhappy DS &
Burning Infrastructure
Data Access: ☹ DS & Infrastructure
Data Access: ☹ DS & Infrastructure
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
These two interact
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
These two interact
Not enough space
Data Access: ☹ DS & Infrastructure
Can’t write fast enough
Can’t read fast enough
These two interact
Not enough space
Limited by
tools
So how does Stitch Fix
mitigate these problems?
Data Access:
S3 & Hive Metastore
What is S3?
● Amazon’s Simple Storage Service.
● Infinite* storage.
● Looks like a file system*:
○ URIs: my.bucket/path/to/files/file.txt
● Can read, write, delete, BUT NOT append (or overwrite).
● Lots of companies rely on it -- famously Dropbox.
What is S3?
* For all intents and purposes
S3 @ Stitch Fix
S3
Writing Data Hard to Saturate
Reading Data Hard to Saturate
Writing & Reading
Interference
Haven’t Experienced
Space “Infinite”
Tooling Lots of Options
● Data Scientists’ main datastore since very early on.
● S3 essentially removes any real worries with respect to data contention!
S3 is not a complete solution!
What is the Hive Metastore?
● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
What is the Hive Metastore?
● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
Hive Metastore:
What is the Hive Metastore?
Partition Location
20161001 s3://bucket/sold_items/20161001
...
20161031 s3://bucket/sold_items/20161031
sold_items
Hive Metastore @ Stitch Fix
Brought into:
● Bring centralized order to data being stored on S3
● Provide metadata to build more tooling on top of
● Enable use of existing open source solutions
● Our central source of truth!
● Never have to worry about space.
● Trading for immediate speed, you have consistent read & write performance.
○ “Contention Free”
● Decoupled data storage layer from data manipulation.
○ Very amenable to supporting a lot of different data sets and tools.
S3 + Hive Metastore
Our Current Picture
Caveat: Eventual Consistency
● Replacing data in a partition
Caveat: Eventual Consistency
● Replacing data in a partition
Caveat: Eventual Consistency
Replacing a file on S3
B
A
Replacing a file on S3
● S3 is eventually
consistent*
● These bugs are hard
to track down
● Need everyone to be
able to trust the data.
A
B
* for existing files
● Use Hive Metastore to easily control partition source of truth
● Principles:
○ Never delete
○ Always write to a new place each time a partition changes
● What do we mean by “new place”?
○ Use an inner directory → called Batch ID
Avoiding Eventual Consistency
Batch ID Pattern
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
sold_items
● Overwriting a partition is just a matter of updating the location
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
s3://bucket/sold_items/
20161031/20161102234252/
sold_items
● Overwriting a partition is just a matter of updating the location
● To the user this is a hidden inner directory
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
s3://bucket/sold_items/
20161031/20161102234252/
sold_items
● Avoids eventual consistency issue
● Jobs finish on the data they started on
● Full partition history:
○ Can rollback
■ Data Scientists are less afraid of mistakes
○ Can create audit trails more easily
■ What data changed and when
○ Can anchor downstream consumers to a particular batch ID
Batch ID Pattern Benefits
Data Access:
Tooling Integration
Recall
Recall
?
?
?
?
?
?
Data Access:
Tooling Integration
1. Enforcing Batch IDs
2. File Formats
3. Schemas for all Tools
4. Schema Evolution
5. Redshift
6. Spark
● Problem:
○ How do you enforce remembering to add a Batch ID into your S3 path?
1. Enforcing Batch IDs
● Problem:
○ How do you enforce remembering to add a Batch ID into your S3 path?
● Solution:
○ By building APIs
■ For all tooling!
1. Enforcing Batch IDs
1. Enforcing Batch IDs via an API
1. Enforcing Batch IDs via an API
Python:
store_dataframe(df, dest_db, dest_table, partitions=[‘2016’])
df = load_dataframe(src_db, src_table, partitions=[‘2016’])
R:
sf_writer(data = result,
namespace = dest_db,
resource = dest_table,
partitions = c(as.integer(opt$ETL_DATE)))
df <- sf_reader(namespace = src_db,
resource = src_table,
partitions = c(as.integer(opt$ETL_DATE)))
1. Enforcing Batch IDs: APIs for DS
1. Enforcing Batch IDs: APIs for DS
Tool Reading From S3+HM Writing to S3+HM
Python Internal API Internal API
R Internal API Internal API
Spark Standard API Internal API
PySpark Standard API Internal API
Presto Standard API N/A
Redshift Load via Internal API N/A
● Problem:
○ What format do you use to work with all the tools?
2. File Format
● Problem:
○ What format do you use to work with all the tools?
● Possible solutions:
○ Parquet
○ Some simple format {JSON, Delimited File} + gzip
○ Avro, Thrift, Protobuffers
2. File Format
● Problem:
○ What format do you use to work with all the tools?
● Possible solutions:
○ Parquet
○ Some simple format {JSON, Delimited File} + gzip
○ Avro, Thrift, Protobuffers
● Philosophy: minimize for operational burden:
○ Choose `0`, i.e. null delimited, gzipped files
■ Easy to write an API for this, for all tools.
2. File Format
● Problem:
○ Can’t necessarily have a single schema for all tools
■ E.g.
● Different type definitions.
3. Schemas for all Tools
● Problem:
○ Can’t necessarily have a single schema for all tools
■ E.g.
● Different type definitions.
● Solution:
○ Define parallel schemas, that have specific types redefined in Hive
Metastore
■ E.g.
● Can redefine decimal type to be double for Presto*.
● This parallel schema would be named prod_presto.
○ Still points to same underlying data.
3. Schemas for all Tools
* It didn’t use to have functioning decimal support
● Problem:
○ How do you handle schema evolution with 80+ Data Scientists?
■ E.g.
● Add a new column
● Delete an old column
4. Schema Evolution
● Problem:
○ How do you handle schema evolution with 80+ Data Scientists?
■ E.g.
● Add a new column
● Delete an old column
● Solution:
○ Append columns to end of schemas.
○ Rename columns as deprecated -- breaks code, but not data.
4. Schema Evolution
● Wait, what? Redshift?
5. Redshift
● Wait, what? Redshift?
○ Predates use of Spark & Presto
○ Redshift was brought in to help joining data
■ Previously DS had to load data & perform joins in R/Python
○ Data Scientists loved Redshift too much:
■ It became a huge source of contention
■ Have been migrating “production” off of it
5. Redshift
● Need:
○ Still want to use Redshift for ad-hoc analysis
● Problem:
○ How do we keep data on S3 in sync with Redshift?
5. Redshift
● Need:
○ Still want to use Redshift for ad-hoc analysis
● Problem:
○ How do we keep data on S3 in sync with Redshift?
● Solution:
○ API that abstracts syncing data with Redshift
■ Keeps schemas in sync
■ Uses standard data warehouse staged table insertion pattern
5. Redshift
● What does our integration with Spark look like?
6. Spark
● What does our integration with Spark look like?
○ Running on Amazon EMR using Netflix's Genie
■ Prod & Dev clusters
○ S3 still source of truth
■ Have custom write API:
● Enforces Batch IDs
● Scala based library making use of EMRFS
● Also exposed in Python for PySpark use
○ Heavy users of Spark SQL
○ It’s the main production workhorse
6. Spark
Ad-hoc
Compute Access:
Using Docker
Data Scientist’s Ad-hoc workflow
Data Scientist’s Ad-hoc workflow
The faster this iteration cycle, the faster Data Scientists can work
Data Scientist’s Ad-hoc workflow
Scaling this part
The faster this iteration cycle, the faster Data Scientists can work
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Medium
High High
Ad hoc Infra: Options
Laptop
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
High High
Ad hoc Infra: Options
Laptop
Shared
Instances
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
High Time & Money
Ad hoc Infra: Options
Laptop
Shared
Instances
Individual
Instances
Workstation Env. Mgmt. Contention Points
Low Memory & CPU
Medium Isolation
Low Time & Money
Ad hoc Infra: Options
Laptop
Shared
Instances
Individual
Instances
● Control of environment
○ Data Scientists don’t need to worry about env.
● Isolation
○ can host many docker containers on a single machine.
● Better host management
○ allowing central control of machine types.
Why Docker?
● Has:
○ Our internal API libraries
○ Jupyter Hub Notebooks:
■ Pyspark, IPython, R, Javscript, Toree
○ Python libs:
■ scikit, numpy, scipy, pandas, etc.
○ RStudio
○ R libs:
■ Dplyr, magrittr, ggplot2, lme4, BOOT, etc.
● Mounts User NFS
● User has terminal access to file system via Jupyter for git, pip, etc.
Ad-Hoc Docker Image
Self Service Ad-hoc Infra: Flotilla
Jupyter Hub on Flotilla
RStudio on Flotilla
Browser Based Terminal on Flotilla
Flotilla Deployment
● Amazon ECS for cluster management.
● EC2 Instances:
○ Custom AMI based on ECS optimized docker image.
● Runs in a single Auto Scale Group.
● S3 backed self-hosted Artifactory as docker repository.
● Docker + Amazon ECS unlocks access to lots of CPU & Memory for DS!
Flotilla Deployment
Flotilla Deployment
Flotilla Deployment
● Docker tightly integrates with the Linux Kernel.
○ Hypothesis:
■ Anything that makes uninterruptable calls to the kernel can:
● Break the ECS agent because the container doesn’t respond.
● Break isolation between containers.
■ E.g. Mounting NFS
● Docker Hub:
○ Weren’t happy with performance
○ Switched to artifactory
Docker Problems So Far
In Summary
● S3 + Hive Metastore is Stitch Fix’s very scalable data warehouse.
● Internally built APIs make S3 + Hive Metastore easier to use for Data Scientists.
● Docker is used to provide a consistent environment for Data Scientists to use.
● Docker + ECS enables a self-service ad-hoc platform for Data Scientists.
In Summary - Reducing Contention
Fin; Thanks! Questions?
@stefkrawczyk
Try out Stitch Fix → stitchfix.com/referral/8406746

More Related Content

What's hot

Building Your Data Streams for all the IoT
Building Your Data Streams for all the IoTBuilding Your Data Streams for all the IoT
Building Your Data Streams for all the IoTDevOps.com
 
An Ambitious Wikidata Tutorial
An Ambitious Wikidata TutorialAn Ambitious Wikidata Tutorial
An Ambitious Wikidata Tutorial_Emw
 
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with HadoopBig Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with HadoopDataWorks Summit
 
Open World Robotics
Open World RoboticsOpen World Robotics
Open World RoboticsVaticle
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
From SKOS over SKOS-XL to Custom Ontologies
From SKOS over SKOS-XL to Custom OntologiesFrom SKOS over SKOS-XL to Custom Ontologies
From SKOS over SKOS-XL to Custom OntologiesSemantic Web Company
 
JavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaJavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaChris Bailey
 
On the Road to DSpace 7: Angular UI + REST
On the Road to DSpace 7: Angular UI + RESTOn the Road to DSpace 7: Angular UI + REST
On the Road to DSpace 7: Angular UI + RESTTim Donohue
 
Or2019 DSpace 7 Enhanced submission &amp; workflow
Or2019 DSpace 7 Enhanced submission &amp; workflowOr2019 DSpace 7 Enhanced submission &amp; workflow
Or2019 DSpace 7 Enhanced submission &amp; workflow4Science
 
Building an Enterprise Knowledge Graph @Uber: Lessons from Reality
Building an Enterprise Knowledge Graph @Uber: Lessons from RealityBuilding an Enterprise Knowledge Graph @Uber: Lessons from Reality
Building an Enterprise Knowledge Graph @Uber: Lessons from RealityJoshua Shinavier
 
Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchNeo4j
 
IndexedDB - An Efficient Way to Manage Data
IndexedDB - An Efficient Way to Manage DataIndexedDB - An Efficient Way to Manage Data
IndexedDB - An Efficient Way to Manage Datasara stanford
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
 
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...Heiko Paulheim
 
Deploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and KubernetesDeploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and Kubernetesconfluent
 
FIWARE Training: IoT and Legacy
FIWARE Training: IoT and LegacyFIWARE Training: IoT and Legacy
FIWARE Training: IoT and LegacyFIWARE
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata StorageDataWorks Summit/Hadoop Summit
 

What's hot (20)

Empirical Semantics
Empirical SemanticsEmpirical Semantics
Empirical Semantics
 
Building Your Data Streams for all the IoT
Building Your Data Streams for all the IoTBuilding Your Data Streams for all the IoT
Building Your Data Streams for all the IoT
 
An Ambitious Wikidata Tutorial
An Ambitious Wikidata TutorialAn Ambitious Wikidata Tutorial
An Ambitious Wikidata Tutorial
 
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with HadoopBig Data Business Wins: Real-time Inventory Tracking with Hadoop
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
 
Open World Robotics
Open World RoboticsOpen World Robotics
Open World Robotics
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
From SKOS over SKOS-XL to Custom Ontologies
From SKOS over SKOS-XL to Custom OntologiesFrom SKOS over SKOS-XL to Custom Ontologies
From SKOS over SKOS-XL to Custom Ontologies
 
JavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaJavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient Java
 
On the Road to DSpace 7: Angular UI + REST
On the Road to DSpace 7: Angular UI + RESTOn the Road to DSpace 7: Angular UI + REST
On the Road to DSpace 7: Angular UI + REST
 
Or2019 DSpace 7 Enhanced submission &amp; workflow
Or2019 DSpace 7 Enhanced submission &amp; workflowOr2019 DSpace 7 Enhanced submission &amp; workflow
Or2019 DSpace 7 Enhanced submission &amp; workflow
 
Building an Enterprise Knowledge Graph @Uber: Lessons from Reality
Building an Enterprise Knowledge Graph @Uber: Lessons from RealityBuilding an Enterprise Knowledge Graph @Uber: Lessons from Reality
Building an Enterprise Knowledge Graph @Uber: Lessons from Reality
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based Search
 
IndexedDB - An Efficient Way to Manage Data
IndexedDB - An Efficient Way to Manage DataIndexedDB - An Efficient Way to Manage Data
IndexedDB - An Efficient Way to Manage Data
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
Deploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and KubernetesDeploying Kafka Streams Applications with Docker and Kubernetes
Deploying Kafka Streams Applications with Docker and Kubernetes
 
FIWARE Training: IoT and Legacy
FIWARE Training: IoT and LegacyFIWARE Training: IoT and Legacy
FIWARE Training: IoT and Legacy
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
 
Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security
 

Similar to Data Day Texas 2017: Scaling Data Science at Stitch Fix

Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixStefan Krawczyk
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixStitch Fix Algorithms
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
How to build data accessibility for everyone
How to build data accessibility for everyoneHow to build data accessibility for everyone
How to build data accessibility for everyoneKaren Hsieh
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Databricks
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + CassandraApache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + CassandraAnant Corporation
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data PlatformDani Solà Lagares
 
Choosing data warehouse considerations
Choosing data warehouse considerationsChoosing data warehouse considerations
Choosing data warehouse considerationsAseem Bansal
 
Serverless Clojure and ML prototyping: an experience report
Serverless Clojure and ML prototyping: an experience reportServerless Clojure and ML prototyping: an experience report
Serverless Clojure and ML prototyping: an experience reportMetosin Oy
 

Similar to Data Day Texas 2017: Scaling Data Science at Stitch Fix (20)

Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
How to build data accessibility for everyone
How to build data accessibility for everyoneHow to build data accessibility for everyone
How to build data accessibility for everyone
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + CassandraApache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
 
Choosing data warehouse considerations
Choosing data warehouse considerationsChoosing data warehouse considerations
Choosing data warehouse considerations
 
Serverless Clojure and ML prototyping: an experience report
Serverless Clojure and ML prototyping: an experience reportServerless Clojure and ML prototyping: an experience report
Serverless Clojure and ML prototyping: an experience report
 

Recently uploaded

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Data Day Texas 2017: Scaling Data Science at Stitch Fix

  • 1. Scaling Data Science At Stitch Fix Stefan Krawczyk @stefkrawczyk linkedin.com/in/skrawczyk January 2017
  • 3. At Stitch Fix we have ~80
  • 4. Two Data Scientist facts: 1. Ability to spin up their own resources*. 2. End to end, they’re responsible.
  • 5. But what do they do?
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 15. Lots of Compute & Data Movement!
  • 16. So how did we get to our scale?
  • 18. & Unhappy Data Scientists Burning Infrastructure Contention is Correlated with
  • 19. Contention on: ● Access to Data ● Access to Compute Res.
  • 20. Contention on: ● Access to Data ● Access to Compute Res. ○ Ad-hoc ○ Production
  • 21. Contention on: ● Access to Data ● Access to Compute Res. ○ Ad-hoc ○ Production Focus of this talk:
  • 22. Fellow Collaborators jeff akshay jacob tarek kurt derek patrick thomas Horizontal team focused on Data Scientist Enablement steven liz alex
  • 23. Data Access: Unhappy DS & Burning Infrastructure
  • 24. Data Access: ☹ DS & Infrastructure
  • 25. Data Access: ☹ DS & Infrastructure
  • 26. Data Access: ☹ DS & Infrastructure Can’t write fast enough
  • 27. Data Access: ☹ DS & Infrastructure Can’t write fast enough Can’t read fast enough
  • 28. Data Access: ☹ DS & Infrastructure Can’t write fast enough Can’t read fast enough These two interact
  • 29. Data Access: ☹ DS & Infrastructure Can’t write fast enough Can’t read fast enough These two interact Not enough space
  • 30. Data Access: ☹ DS & Infrastructure Can’t write fast enough Can’t read fast enough These two interact Not enough space Limited by tools
  • 31. So how does Stitch Fix mitigate these problems?
  • 32. Data Access: S3 & Hive Metastore
  • 34. ● Amazon’s Simple Storage Service. ● Infinite* storage. ● Looks like a file system*: ○ URIs: my.bucket/path/to/files/file.txt ● Can read, write, delete, BUT NOT append (or overwrite). ● Lots of companies rely on it -- famously Dropbox. What is S3? * For all intents and purposes
  • 35. S3 @ Stitch Fix S3 Writing Data Hard to Saturate Reading Data Hard to Saturate Writing & Reading Interference Haven’t Experienced Space “Infinite” Tooling Lots of Options ● Data Scientists’ main datastore since very early on. ● S3 essentially removes any real worries with respect to data contention!
  • 36. S3 is not a complete solution!
  • 37. What is the Hive Metastore?
  • 38. ● Hadoop service, that stores: ○ Schema ○ Partition information, e.g. date ○ Data location for a partition What is the Hive Metastore?
  • 39. ● Hadoop service, that stores: ○ Schema ○ Partition information, e.g. date ○ Data location for a partition Hive Metastore: What is the Hive Metastore? Partition Location 20161001 s3://bucket/sold_items/20161001 ... 20161031 s3://bucket/sold_items/20161031 sold_items
  • 40. Hive Metastore @ Stitch Fix Brought into: ● Bring centralized order to data being stored on S3 ● Provide metadata to build more tooling on top of ● Enable use of existing open source solutions
  • 41. ● Our central source of truth! ● Never have to worry about space. ● Trading for immediate speed, you have consistent read & write performance. ○ “Contention Free” ● Decoupled data storage layer from data manipulation. ○ Very amenable to supporting a lot of different data sets and tools. S3 + Hive Metastore
  • 44. ● Replacing data in a partition Caveat: Eventual Consistency
  • 45. ● Replacing data in a partition Caveat: Eventual Consistency
  • 46. Replacing a file on S3 B A
  • 47. Replacing a file on S3 ● S3 is eventually consistent* ● These bugs are hard to track down ● Need everyone to be able to trust the data. A B * for existing files
  • 48. ● Use Hive Metastore to easily control partition source of truth ● Principles: ○ Never delete ○ Always write to a new place each time a partition changes ● What do we mean by “new place”? ○ Use an inner directory → called Batch ID Avoiding Eventual Consistency
  • 50. Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ sold_items
  • 51. ● Overwriting a partition is just a matter of updating the location Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ s3://bucket/sold_items/ 20161031/20161102234252/ sold_items
  • 52. ● Overwriting a partition is just a matter of updating the location ● To the user this is a hidden inner directory Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ s3://bucket/sold_items/ 20161031/20161102234252/ sold_items
  • 53. ● Avoids eventual consistency issue ● Jobs finish on the data they started on ● Full partition history: ○ Can rollback ■ Data Scientists are less afraid of mistakes ○ Can create audit trails more easily ■ What data changed and when ○ Can anchor downstream consumers to a particular batch ID Batch ID Pattern Benefits
  • 57. Data Access: Tooling Integration 1. Enforcing Batch IDs 2. File Formats 3. Schemas for all Tools 4. Schema Evolution 5. Redshift 6. Spark
  • 58. ● Problem: ○ How do you enforce remembering to add a Batch ID into your S3 path? 1. Enforcing Batch IDs
  • 59. ● Problem: ○ How do you enforce remembering to add a Batch ID into your S3 path? ● Solution: ○ By building APIs ■ For all tooling! 1. Enforcing Batch IDs
  • 60. 1. Enforcing Batch IDs via an API
  • 61. 1. Enforcing Batch IDs via an API
  • 62. Python: store_dataframe(df, dest_db, dest_table, partitions=[‘2016’]) df = load_dataframe(src_db, src_table, partitions=[‘2016’]) R: sf_writer(data = result, namespace = dest_db, resource = dest_table, partitions = c(as.integer(opt$ETL_DATE))) df <- sf_reader(namespace = src_db, resource = src_table, partitions = c(as.integer(opt$ETL_DATE))) 1. Enforcing Batch IDs: APIs for DS
  • 63. 1. Enforcing Batch IDs: APIs for DS Tool Reading From S3+HM Writing to S3+HM Python Internal API Internal API R Internal API Internal API Spark Standard API Internal API PySpark Standard API Internal API Presto Standard API N/A Redshift Load via Internal API N/A
  • 64. ● Problem: ○ What format do you use to work with all the tools? 2. File Format
  • 65. ● Problem: ○ What format do you use to work with all the tools? ● Possible solutions: ○ Parquet ○ Some simple format {JSON, Delimited File} + gzip ○ Avro, Thrift, Protobuffers 2. File Format
  • 66. ● Problem: ○ What format do you use to work with all the tools? ● Possible solutions: ○ Parquet ○ Some simple format {JSON, Delimited File} + gzip ○ Avro, Thrift, Protobuffers ● Philosophy: minimize for operational burden: ○ Choose `0`, i.e. null delimited, gzipped files ■ Easy to write an API for this, for all tools. 2. File Format
  • 67. ● Problem: ○ Can’t necessarily have a single schema for all tools ■ E.g. ● Different type definitions. 3. Schemas for all Tools
  • 68. ● Problem: ○ Can’t necessarily have a single schema for all tools ■ E.g. ● Different type definitions. ● Solution: ○ Define parallel schemas, that have specific types redefined in Hive Metastore ■ E.g. ● Can redefine decimal type to be double for Presto*. ● This parallel schema would be named prod_presto. ○ Still points to same underlying data. 3. Schemas for all Tools * It didn’t use to have functioning decimal support
  • 69. ● Problem: ○ How do you handle schema evolution with 80+ Data Scientists? ■ E.g. ● Add a new column ● Delete an old column 4. Schema Evolution
  • 70. ● Problem: ○ How do you handle schema evolution with 80+ Data Scientists? ■ E.g. ● Add a new column ● Delete an old column ● Solution: ○ Append columns to end of schemas. ○ Rename columns as deprecated -- breaks code, but not data. 4. Schema Evolution
  • 71. ● Wait, what? Redshift? 5. Redshift
  • 72. ● Wait, what? Redshift? ○ Predates use of Spark & Presto ○ Redshift was brought in to help joining data ■ Previously DS had to load data & perform joins in R/Python ○ Data Scientists loved Redshift too much: ■ It became a huge source of contention ■ Have been migrating “production” off of it 5. Redshift
  • 73. ● Need: ○ Still want to use Redshift for ad-hoc analysis ● Problem: ○ How do we keep data on S3 in sync with Redshift? 5. Redshift
  • 74. ● Need: ○ Still want to use Redshift for ad-hoc analysis ● Problem: ○ How do we keep data on S3 in sync with Redshift? ● Solution: ○ API that abstracts syncing data with Redshift ■ Keeps schemas in sync ■ Uses standard data warehouse staged table insertion pattern 5. Redshift
  • 75. ● What does our integration with Spark look like? 6. Spark
  • 76. ● What does our integration with Spark look like? ○ Running on Amazon EMR using Netflix's Genie ■ Prod & Dev clusters ○ S3 still source of truth ■ Have custom write API: ● Enforces Batch IDs ● Scala based library making use of EMRFS ● Also exposed in Python for PySpark use ○ Heavy users of Spark SQL ○ It’s the main production workhorse 6. Spark
  • 79. Data Scientist’s Ad-hoc workflow The faster this iteration cycle, the faster Data Scientists can work
  • 80. Data Scientist’s Ad-hoc workflow Scaling this part The faster this iteration cycle, the faster Data Scientists can work
  • 81. Workstation Env. Mgmt. Contention Points Low Memory & CPU Medium Medium High High Ad hoc Infra: Options Laptop
  • 82. Workstation Env. Mgmt. Contention Points Low Memory & CPU Medium Isolation High High Ad hoc Infra: Options Laptop Shared Instances
  • 83. Workstation Env. Mgmt. Contention Points Low Memory & CPU Medium Isolation High Time & Money Ad hoc Infra: Options Laptop Shared Instances Individual Instances
  • 84. Workstation Env. Mgmt. Contention Points Low Memory & CPU Medium Isolation Low Time & Money Ad hoc Infra: Options Laptop Shared Instances Individual Instances
  • 85. ● Control of environment ○ Data Scientists don’t need to worry about env. ● Isolation ○ can host many docker containers on a single machine. ● Better host management ○ allowing central control of machine types. Why Docker?
  • 86. ● Has: ○ Our internal API libraries ○ Jupyter Hub Notebooks: ■ Pyspark, IPython, R, Javscript, Toree ○ Python libs: ■ scikit, numpy, scipy, pandas, etc. ○ RStudio ○ R libs: ■ Dplyr, magrittr, ggplot2, lme4, BOOT, etc. ● Mounts User NFS ● User has terminal access to file system via Jupyter for git, pip, etc. Ad-Hoc Docker Image
  • 87. Self Service Ad-hoc Infra: Flotilla
  • 88. Jupyter Hub on Flotilla
  • 90. Browser Based Terminal on Flotilla
  • 91. Flotilla Deployment ● Amazon ECS for cluster management. ● EC2 Instances: ○ Custom AMI based on ECS optimized docker image. ● Runs in a single Auto Scale Group. ● S3 backed self-hosted Artifactory as docker repository. ● Docker + Amazon ECS unlocks access to lots of CPU & Memory for DS!
  • 95. ● Docker tightly integrates with the Linux Kernel. ○ Hypothesis: ■ Anything that makes uninterruptable calls to the kernel can: ● Break the ECS agent because the container doesn’t respond. ● Break isolation between containers. ■ E.g. Mounting NFS ● Docker Hub: ○ Weren’t happy with performance ○ Switched to artifactory Docker Problems So Far
  • 97. ● S3 + Hive Metastore is Stitch Fix’s very scalable data warehouse. ● Internally built APIs make S3 + Hive Metastore easier to use for Data Scientists. ● Docker is used to provide a consistent environment for Data Scientists to use. ● Docker + ECS enables a self-service ad-hoc platform for Data Scientists. In Summary - Reducing Contention
  • 98. Fin; Thanks! Questions? @stefkrawczyk Try out Stitch Fix → stitchfix.com/referral/8406746