A compute infrastructure for
Data Scientists
Neelesh Salian
Stitch Fix
About Me
This talk..
● Data @ Stitch Fix
● Data Warehouse
● Compute Infrastructure
● Questions
Data @ Stitch Fix
Stitch Fix
● Personal styling service. We serve men & women, plus sizes.
● Founded in 2011, Led by CEO & Founder, Katrina Lake
● Employ more than 5,800 nationwide (USA)
● Algorithms + Humans
● Data Science is our culture
Algorithms Team
1. Data Science
○ Vertical teams based on functionality
○ Enablers of Data for the business
○ Cross-functional collaboration
2. Data Platform
○ Building a Platform for Data Science
○ Responsible for Access, Ingestion and Tooling
Read More
https://multithreaded.stitchfix.com/
http://algorithms-tour.stitchfix.com/
Challenges - Usability is Key
● Bringing SQL into Spark
● Training Data Scientists’ with Spark basics
● Assistance in Tuning recommendations to optimize
● Maintain latest releases for features and fixes
Data Warehouse
Data Warehouse Details
● S3 is the source of truth
● HiveMetastore used to store tabular information - schema, metadata
● Python and R based Readers and Writers
● Presto is used to read i.e. ad-hoc queries
● Spark is used for testing/ production complex applications
Compute Infrastructure
Spark
Genie
Genie
● Redirection Layer
● From Netflix
● Open Source - 2.x
● Used for Spark jobs - Load balance, Maintain Versions, Maintaining
Multi-Tenant environment
Genie Data Model - Applications
● Custom Builds of Spark
● Spark 2.x may refer 2.1.0 or 2.2.0
Applications
Genie Data Model - Commands
● Equivalent of spark-submit for each version
● Tags (aliases)
● Test commands with different configurations
Commands
Genie Data Model - Clusters
● Red/ Black Cluster deployment
● No disruption to Data Scientists’ workflow
Commands
Genie Data Model
Applications
Commands Clusters
Jobs
Babylon
Sheriff of Babylon
● Job Submissions
● Genie client library
● Overrides to Spark default values
● Run Commands
1. run_spark
2. run_sql
3. run_with_json
Additional Tools
● Custom SparkContext
● Metastore Library
● Table Statistics
● Python Query API
● Diagnostic Tool
Spaceman - Centralized Spark History
Server
Spaceman
● Archiving Log Data from all clusters
● Singular entity to access Spark logs
Spaceman - Features
● Runs for each cluster - easy to add new clusters
● Checks for completed applications at intervals
● Stores History of Apps from all clusters for a month
Readers and Writers
Readers and Writers
● Python and R based clients
● Uses pyarrow while using pandas dataframe for reading and writing tabular
data
● Backed by Livy - reduces bootstrapping with Genie
● Interfaces with the HiveMetastore
We are Hiring
https://multithreaded.stitchfix.com/careers/
Thank you

A compute infrastructure for data scientists

  • 1.
    A compute infrastructurefor Data Scientists Neelesh Salian Stitch Fix
  • 2.
  • 3.
    This talk.. ● Data@ Stitch Fix ● Data Warehouse ● Compute Infrastructure ● Questions
  • 4.
  • 5.
    Stitch Fix ● Personalstyling service. We serve men & women, plus sizes. ● Founded in 2011, Led by CEO & Founder, Katrina Lake ● Employ more than 5,800 nationwide (USA) ● Algorithms + Humans ● Data Science is our culture
  • 7.
    Algorithms Team 1. DataScience ○ Vertical teams based on functionality ○ Enablers of Data for the business ○ Cross-functional collaboration 2. Data Platform ○ Building a Platform for Data Science ○ Responsible for Access, Ingestion and Tooling Read More https://multithreaded.stitchfix.com/ http://algorithms-tour.stitchfix.com/
  • 8.
    Challenges - Usabilityis Key ● Bringing SQL into Spark ● Training Data Scientists’ with Spark basics ● Assistance in Tuning recommendations to optimize ● Maintain latest releases for features and fixes
  • 9.
  • 11.
    Data Warehouse Details ●S3 is the source of truth ● HiveMetastore used to store tabular information - schema, metadata ● Python and R based Readers and Writers ● Presto is used to read i.e. ad-hoc queries ● Spark is used for testing/ production complex applications
  • 12.
  • 13.
  • 15.
  • 16.
    Genie ● Redirection Layer ●From Netflix ● Open Source - 2.x ● Used for Spark jobs - Load balance, Maintain Versions, Maintaining Multi-Tenant environment
  • 17.
    Genie Data Model- Applications ● Custom Builds of Spark ● Spark 2.x may refer 2.1.0 or 2.2.0 Applications
  • 18.
    Genie Data Model- Commands ● Equivalent of spark-submit for each version ● Tags (aliases) ● Test commands with different configurations Commands
  • 19.
    Genie Data Model- Clusters ● Red/ Black Cluster deployment ● No disruption to Data Scientists’ workflow Commands
  • 20.
  • 21.
  • 22.
    Sheriff of Babylon ●Job Submissions ● Genie client library ● Overrides to Spark default values ● Run Commands 1. run_spark 2. run_sql 3. run_with_json
  • 23.
    Additional Tools ● CustomSparkContext ● Metastore Library ● Table Statistics ● Python Query API ● Diagnostic Tool
  • 24.
    Spaceman - CentralizedSpark History Server
  • 25.
    Spaceman ● Archiving LogData from all clusters ● Singular entity to access Spark logs
  • 26.
    Spaceman - Features ●Runs for each cluster - easy to add new clusters ● Checks for completed applications at intervals ● Stores History of Apps from all clusters for a month
  • 27.
  • 28.
    Readers and Writers ●Python and R based clients ● Uses pyarrow while using pandas dataframe for reading and writing tabular data ● Backed by Livy - reduces bootstrapping with Genie ● Interfaces with the HiveMetastore
  • 30.
  • 31.