SlideShare a Scribd company logo
Data Science
in the Cloud
Stefan Krawczyk
@stefkrawczyk
linkedin.com/in/skrawczyk
November 2016
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
stitchfix-cloud
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
Who are Data Scientists?
Means: skills vary wildly
But they’re in
demand and expensive
“The Sexiest Job
of the 21st Century”
- HBR
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
How many
Data Scientists do you have?
At Stitch Fix we have ~80
~85% have not done formal CS
But what do they do?
What is Stitch Fix?
Two Data Scientist facts:
1. Has AWS console access*.
2. End to end,
they’re responsible.
How do we enable this without
?
Make doing the right
thing the easy thing.
Fellow Collaborators
Horizontal team focused on Data Scientist Enablement
1. Eng. Skills
2. Important
3. What they work on
Let’s Start
Will Only Cover
1. Source of truth: S3 & Hive Metastore
2. Docker Enabled DS @ Stitch Fix
3. Scaling DS doing ML in the Cloud
Source of truth:
S3 & Hive Metastore
Want Everyone to Have Same View
A
B
This is Usually Nothing to Worry About
● OS handles correct access
● DB has ACID properties
A
B
This is Usually Nothing to Worry About
● OS handles correct access
● DB has ACID properties
● But it’s easy to outgrow these
options with a big data/team.
A
B
● Amazon’s Simple Storage Service
● Infinite* storage
● Can write, read, delete, BUT NOT append.
● Looks like a file system*:
○ URIs: my.bucket/path/to/files/file.txt
● Scales well
S3
* For all intents and purposes
● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
Hive Metastore
● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
Hive Metastore:
Hive Metastore
Partition Location
20161001 s3://bucket/sold_items/20161001
...
20161031 s3://bucket/sold_items/20161031
sold_items
Hive Metastore
● Replacing data in a partition
But if we’re not careful
● Replacing data in a partition
But if we’re not careful
But if we’re not careful
B
A
But if we’re not careful
● S3 is eventually
consistent
● These bugs are hard
to track down
A
B
● Use Hive Metastore to control partition source of truth
● Principles:
○ Never delete
○ Always write to a new place each time a partition changes
● Stitch Fix solution:
○ Use an inner directory → called Batch ID
Hive Metastore to the Rescue
Batch ID Pattern
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
sold_items
● Overwriting a partition is just a matter of updating the location
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
s3://bucket/sold_items/
20161031/20161102234252
sold_items
● Overwriting a partition is just a matter of updating the location
● To the user this is a hidden inner directory
Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031 s3://bucket/sold_items/
20161031/20161101002256/
s3://bucket/sold_items/
20161031/20161102234252
sold_items
Enforce via API
Enforce via API
Python:
store_dataframe(df, dest_db, dest_table, partitions=[‘2016’])
df = load_dataframe(src_db, src_table, partitions=[‘2016’])
R:
sf_writer(data = result,
namespace = dest_db,
resource = dest_table,
partitions = c(as.integer(opt$ETL_DATE)))
sf_reader(namespace = src_db,
resource = src_table,
partitions = c(as.integer(opt$ETL_DATE)))
API for Data Scientists
● Full partition history
○ Can rollback
■ Data Scientists are less afraid of mistakes
○ Can create audit trails more easily
■ What data changed and when
○ Can anchor downstream consumers to a particular batch ID
Batch ID Pattern Benefits
Docker Enabled
DS @ Stitch Fix
Workstation Env. Mgmt. Scalability
Low Low
Medium Medium
High High
Ad hoc Infra: In the Beginning...
Workstation Env. Mgmt. Scalability
Low Low
Medium Medium
High High
Ad hoc Infra: Evolution I
Workstation Env. Mgmt. Scalability
Low Low
Medium Medium
High High
Ad hoc Infra: Evolution II
Workstation Env. Mgmt. Scalability
Low Low
Medium Medium
Low High
Ad hoc Infra: Evolution III
● Control of environment
○ Data Scientists don’t need to worry about env.
● Isolation
○ can host many docker containers on a single machine.
● Better host management
○ allowing central control of machine types.
Why Does Docker Lower Overhead?
Flotilla UI
● Has:
○ Our internal API libraries
○ Jupyter Notebook:
■ Pyspark
■ IPython
○ Python libs:
■ scikit, numpy, scipy, pandas, etc.
○ RStudio
○ R libs:
■ Dplyr, magrittr, ggplot2, lme4, BOOT, etc.
● Mounts User NFS
● User has terminal access to file system via Jupyter for git, pip, etc.
Our Docker Image
Docker Deployment
Docker Deployment
Docker Deployment
● Docker tightly integrates with the Linux Kernel.
○ Hypothesis:
■ Anything that makes uninterruptable calls to the kernel can:
● Break the ECS agent because the container doesn’t respond.
● Break isolation between containers.
■ E.g. Mounting NFS
● Docker Hub:
○ Switched to artifactory
Our Docker Problems So Far
Scaling
DS doing ML
in the Cloud
1. Data Latency
2. To Batch
or Not To Batch
3. What’s in a Model?
Data Latency
How much time do you spend waiting for data?
*This could be a laptop, a shared system, a batch process, etc.
Use Compression
*This could be a laptop, a shared system, a batch process, etc.
Use Compression - The Components
[ 1.3234543 0.23443434 … ]
[ 1 0 0 1 0 0 … 0 1 0 0
0 1 0 1 ...
… 1 0 1 1 ]
[ 1 0 0 1 0 0 … 0 1 0 0 ]
[ 1.3234543 0.23443434 … ]
{ 100: 0.56, … ,110: 0.65,
… , … , 999: 0.43 }
Use Compression - Python Comparison
Pickle: 60MB
Zlib+Pickle: 129KB
JSON: 15MB
Zlib+JSON: 55KB
Pickle: 3.1KB
Zlib+Pickle: 921B
JSON: 2.8KB
Zlib+JSON: 681B
Pickle: 2.6MB
Zlib+Pickle: 600KB
JSON: 769KB
Zlib+JSON: 139KB
[ 1.3234543 0.23443434 … ]
[ 1 0 0 1 0 0 … 0 1 0 0
0 1 0 1 ...
… 1 0 1 1 ]
[ 1 0 0 1 0 0 … 0 1 0 0 ]
[ 1.3234543 0.23443434 … ]
{ 100: 0.56, … ,110: 0.65,
… , … , 999: 0.43 }
● Naïve scheme of JSON + Zlib works well:
Observations
import json
import zlib
...
# compress
compressed = zlib.compress(json.dumps(value))
# decompress
original = json.loads(zlib.decompress(compressed))
● Naïve scheme of JSON + Zlib works well:
● Double vs Float: do you really need to store that much precision?
Observations
import json
import zlib
...
# compress
compressed = zlib.compress(json.dumps(value))
# decompress
original = json.loads(zlib.decompress(compressed))
● Naïve scheme of JSON + Zlib works well:
● Double vs Float: do you really need to store that much precision?
● For more inspiration look to columnar DBs and how they compress columns
Observations
import json
import zlib
...
# compress
compressed = zlib.compress(json.dumps(value))
# decompress
original = json.loads(zlib.decompress(compressed))
To Batch or Not To Batch:
When is batch inefficient?
● Online:
○ Computation occurs synchronously when needed.
● Streamed:
○ Computation is triggered by an event(s).
Online & Streamed Computation
Online & Streamed Computation
Very likely
you start with
a batch system
Online & Streamed Computation
● Do you need to recompute:
○ features for all users?
○ predicted results for all users?
Very likely
you start with
a batch system
Online & Streamed Computation
● Do you need to recompute:
○ features for all users?
○ predicted results for all users?
● Are you heavily dependent on your
ETL running every night?
Very likely
you start with
a batch system
Online & Streamed Computation
● Do you need to recompute:
○ features for all users?
○ predicted results for all users?
● Are you heavily dependent on your
ETL running every night?
● Online vs Streamed depends on in
house factors:
○ Number of models
○ How often they change
○ Cadence of output required
○ In house eng. expertise
○ etc.
Very likely
you start with
a batch system
Online & Streamed Computation
● Do you need to recompute:
○ features for all users?
○ predicted results for all users?
● Are you heavily dependent on your
ETL running every night?
● Online vs Streamed depends on in
house factors:
○ Number of models
○ How often they change
○ Cadence of output required
○ In house eng. expertise
○ etc.
Very likely
you start with
a batch system
We use online
system for
recommendations
Streamed Example
Streamed Example
Streamed Example
Streamed Example
● Dedicated infrastructure → More room on batch infrastructure
○ Hopefully $$$ savings
○ Hopefully less stressed Data Scientists
Online/Streaming Thoughts
● Dedicated infrastructure → More room on batch infrastructure
○ Hopefully $$$ savings
○ Hopefully less stressed Data Scientists
● Requires better software engineering practices
○ Code portability/reuse
○ Designing APIs/Tools Data Scientists will use
Online/Streaming Thoughts
● Dedicated infrastructure → More room on batch infrastructure
○ Hopefully $$$ savings
○ Hopefully less stressed Data Scientists
● Requires better software engineering practices
○ Code portability/reuse
○ Designing APIs/Tools Data Scientists will use
● Prototyping on AWS Lambda & Kinesis was surprisingly quick
○ Need to compile C libs on an amazon linux instance
Online/Streaming Thoughts
What’s in a Model?
Scaling model knowledge
Ever:
● Had someone leave and then nobody understands how they trained their
models?
Ever:
● Had someone leave and then nobody understands how they trained their
models?
○ Or you didn’t remember yourself?
Ever:
● Had someone leave and then nobody understands how they trained their
models?
○ Or you didn’t remember yourself?
● Had performance dip in models and you have trouble figuring out why?
Ever:
● Had someone leave and then nobody understands how they trained their
models?
○ Or you didn’t remember yourself?
● Had performance dip in models and you have trouble figuring out why?
○ Or not known what’s changed between model deployments?
Ever:
● Had someone leave and then nobody understands how they trained their
models?
○ Or you didn’t remember yourself?
● Had performance dip in models and you have trouble figuring out why?
○ Or not known what’s changed between model deployments?
● Wanted to compare model performance over time?
Ever:
● Had someone leave and then nobody understands how they trained their
models?
○ Or you didn’t remember yourself?
● Had performance dip in models and you have trouble figuring out why?
○ Or not known what’s changed between model deployments?
● Wanted to compare model performance over time?
● Wanted to train a model in R/Python/Spark and then deploy it a webserver?
Produce Model Artifacts
● Isn’t that just saving the coefficients/model values?
Produce Model Artifacts
● Isn’t that just saving the coefficients/model values?
○ NO!
Produce Model Artifacts
● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
Produce Model Artifacts
● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
Produce Model Artifacts
● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
How do you deal with
organizational drift?
Produce Model Artifacts
● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
How do you deal with
organizational drift?
Produce Model Artifacts
Makes it easy to keep an
archive and track
changes over time
● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
How do you deal with
organizational drift?
Produce Model Artifacts
Helps a lot with model
debugging & diagnosis!
Makes it easy to keep an
archive and track
changes over time
● Isn’t that just saving the coefficients/model values?
○ NO!
● Why?
How do you deal with
organizational drift?
Produce Model Artifacts
Helps a lot with model
debugging & diagnosis!
Makes it easy to keep an
archive and track
changes over time Can more easily use in
downstream processes
● Analogous to software libraries
● Packaging:
○ Zip/Jar file
Produce Model Artifacts
But all the above
seems complex?
We’re building APIs.
Fin; Questions?
@stefkrawczyk
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
stitchfix-cloud

More Related Content

What's hot

SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Milan...
SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Milan...SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Milan...
SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Milan...
Codemotion
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
Jonathan Holloway
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
Kristian Alexander
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
Lars Albertsson
 
Cassandra advanced data modeling
Cassandra advanced data modelingCassandra advanced data modeling
Cassandra advanced data modeling
Romain Hardouin
 
IOT with PostgreSQL
IOT with PostgreSQLIOT with PostgreSQL
IOT with PostgreSQL
EDB
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
Karissa Rae McKelvey
 
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Big Data Spain
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandas
maikroeder
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen Fan
Databricks
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
Lars Albertsson
 
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Spark Summit
 
Flink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - BerlinFlink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - Berlin
David Morin
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and developmentWes McKinney
 
A quick review of Python and Graph Databases
A quick review of Python and Graph DatabasesA quick review of Python and Graph Databases
A quick review of Python and Graph Databases
Nicholas Crouch
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
Nathan Bijnens
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Databricks
 

What's hot (20)

SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Milan...
SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Milan...SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Milan...
SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Milan...
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Cassandra advanced data modeling
Cassandra advanced data modelingCassandra advanced data modeling
Cassandra advanced data modeling
 
IOT with PostgreSQL
IOT with PostgreSQLIOT with PostgreSQL
IOT with PostgreSQL
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandas
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen Fan
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
 
Flink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - BerlinFlink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - Berlin
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
 
A quick review of Python and Graph Databases
A quick review of Python and Graph DatabasesA quick review of Python and Graph Databases
A quick review of Python and Graph Databases
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 

Similar to Data Science in the Cloud @StitchFix

Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
C4Media
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
Alluxio, Inc.
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
NETWAYS
 
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Stefan Krawczyk
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
Monal Daxini
 
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriThinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Demi Ben-Ari
 
Big data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsBig data @ Hootsuite analtyics
Big data @ Hootsuite analtyics
Claudiu Coman
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
aspyker
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
Drew Hansen
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Demi Ben-Ari
 
Virtual training intro to InfluxDB - June 2021
Virtual training  intro to InfluxDB  - June 2021Virtual training  intro to InfluxDB  - June 2021
Virtual training intro to InfluxDB - June 2021
InfluxData
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
confluent
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Demi Ben-Ari
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Codemotion
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Codemotion
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Demi Ben-Ari
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Ahmed Ossama
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
NETWAYS
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 

Similar to Data Science in the Cloud @StitchFix (20)

Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriThinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
 
Big data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsBig data @ Hootsuite analtyics
Big data @ Hootsuite analtyics
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Virtual training intro to InfluxDB - June 2021
Virtual training  intro to InfluxDB  - June 2021Virtual training  intro to InfluxDB  - June 2021
Virtual training intro to InfluxDB - June 2021
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 

More from C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
C4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
C4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
C4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
C4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
C4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
C4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
C4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
C4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
C4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
C4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
C4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
C4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
C4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
C4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
C4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
C4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
C4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
C4Media
 

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Recently uploaded

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 

Recently uploaded (20)

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 

Data Science in the Cloud @StitchFix

  • 1. Data Science in the Cloud Stefan Krawczyk @stefkrawczyk linkedin.com/in/skrawczyk November 2016
  • 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ stitchfix-cloud
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4. Who are Data Scientists?
  • 5.
  • 6.
  • 7.
  • 9. But they’re in demand and expensive
  • 10. “The Sexiest Job of the 21st Century” - HBR https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
  • 11. How many Data Scientists do you have?
  • 12. At Stitch Fix we have ~80
  • 13. ~85% have not done formal CS
  • 14. But what do they do?
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. Two Data Scientist facts: 1. Has AWS console access*. 2. End to end, they’re responsible.
  • 24. How do we enable this without ?
  • 25. Make doing the right thing the easy thing.
  • 26. Fellow Collaborators Horizontal team focused on Data Scientist Enablement
  • 27. 1. Eng. Skills 2. Important 3. What they work on
  • 29. Will Only Cover 1. Source of truth: S3 & Hive Metastore 2. Docker Enabled DS @ Stitch Fix 3. Scaling DS doing ML in the Cloud
  • 30. Source of truth: S3 & Hive Metastore
  • 31. Want Everyone to Have Same View A B
  • 32. This is Usually Nothing to Worry About ● OS handles correct access ● DB has ACID properties A B
  • 33. This is Usually Nothing to Worry About ● OS handles correct access ● DB has ACID properties ● But it’s easy to outgrow these options with a big data/team. A B
  • 34. ● Amazon’s Simple Storage Service ● Infinite* storage ● Can write, read, delete, BUT NOT append. ● Looks like a file system*: ○ URIs: my.bucket/path/to/files/file.txt ● Scales well S3 * For all intents and purposes
  • 35. ● Hadoop service, that stores: ○ Schema ○ Partition information, e.g. date ○ Data location for a partition Hive Metastore
  • 36. ● Hadoop service, that stores: ○ Schema ○ Partition information, e.g. date ○ Data location for a partition Hive Metastore: Hive Metastore Partition Location 20161001 s3://bucket/sold_items/20161001 ... 20161031 s3://bucket/sold_items/20161031 sold_items
  • 38. ● Replacing data in a partition But if we’re not careful
  • 39. ● Replacing data in a partition But if we’re not careful
  • 40. But if we’re not careful B A
  • 41. But if we’re not careful ● S3 is eventually consistent ● These bugs are hard to track down A B
  • 42. ● Use Hive Metastore to control partition source of truth ● Principles: ○ Never delete ○ Always write to a new place each time a partition changes ● Stitch Fix solution: ○ Use an inner directory → called Batch ID Hive Metastore to the Rescue
  • 44. Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ sold_items
  • 45. ● Overwriting a partition is just a matter of updating the location Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ s3://bucket/sold_items/ 20161031/20161102234252 sold_items
  • 46. ● Overwriting a partition is just a matter of updating the location ● To the user this is a hidden inner directory Batch ID Pattern Date Location 20161001 s3://bucket/sold_items/ 20161001/20161002002334/ ... ... 20161031 s3://bucket/sold_items/ 20161031/20161101002256/ s3://bucket/sold_items/ 20161031/20161102234252 sold_items
  • 49. Python: store_dataframe(df, dest_db, dest_table, partitions=[‘2016’]) df = load_dataframe(src_db, src_table, partitions=[‘2016’]) R: sf_writer(data = result, namespace = dest_db, resource = dest_table, partitions = c(as.integer(opt$ETL_DATE))) sf_reader(namespace = src_db, resource = src_table, partitions = c(as.integer(opt$ETL_DATE))) API for Data Scientists
  • 50. ● Full partition history ○ Can rollback ■ Data Scientists are less afraid of mistakes ○ Can create audit trails more easily ■ What data changed and when ○ Can anchor downstream consumers to a particular batch ID Batch ID Pattern Benefits
  • 51. Docker Enabled DS @ Stitch Fix
  • 52. Workstation Env. Mgmt. Scalability Low Low Medium Medium High High Ad hoc Infra: In the Beginning...
  • 53. Workstation Env. Mgmt. Scalability Low Low Medium Medium High High Ad hoc Infra: Evolution I
  • 54. Workstation Env. Mgmt. Scalability Low Low Medium Medium High High Ad hoc Infra: Evolution II
  • 55. Workstation Env. Mgmt. Scalability Low Low Medium Medium Low High Ad hoc Infra: Evolution III
  • 56. ● Control of environment ○ Data Scientists don’t need to worry about env. ● Isolation ○ can host many docker containers on a single machine. ● Better host management ○ allowing central control of machine types. Why Does Docker Lower Overhead?
  • 58. ● Has: ○ Our internal API libraries ○ Jupyter Notebook: ■ Pyspark ■ IPython ○ Python libs: ■ scikit, numpy, scipy, pandas, etc. ○ RStudio ○ R libs: ■ Dplyr, magrittr, ggplot2, lme4, BOOT, etc. ● Mounts User NFS ● User has terminal access to file system via Jupyter for git, pip, etc. Our Docker Image
  • 62. ● Docker tightly integrates with the Linux Kernel. ○ Hypothesis: ■ Anything that makes uninterruptable calls to the kernel can: ● Break the ECS agent because the container doesn’t respond. ● Break isolation between containers. ■ E.g. Mounting NFS ● Docker Hub: ○ Switched to artifactory Our Docker Problems So Far
  • 64. 1. Data Latency 2. To Batch or Not To Batch 3. What’s in a Model?
  • 65. Data Latency How much time do you spend waiting for data?
  • 66. *This could be a laptop, a shared system, a batch process, etc.
  • 67. Use Compression *This could be a laptop, a shared system, a batch process, etc.
  • 68. Use Compression - The Components [ 1.3234543 0.23443434 … ] [ 1 0 0 1 0 0 … 0 1 0 0 0 1 0 1 ... … 1 0 1 1 ] [ 1 0 0 1 0 0 … 0 1 0 0 ] [ 1.3234543 0.23443434 … ] { 100: 0.56, … ,110: 0.65, … , … , 999: 0.43 }
  • 69. Use Compression - Python Comparison Pickle: 60MB Zlib+Pickle: 129KB JSON: 15MB Zlib+JSON: 55KB Pickle: 3.1KB Zlib+Pickle: 921B JSON: 2.8KB Zlib+JSON: 681B Pickle: 2.6MB Zlib+Pickle: 600KB JSON: 769KB Zlib+JSON: 139KB [ 1.3234543 0.23443434 … ] [ 1 0 0 1 0 0 … 0 1 0 0 0 1 0 1 ... … 1 0 1 1 ] [ 1 0 0 1 0 0 … 0 1 0 0 ] [ 1.3234543 0.23443434 … ] { 100: 0.56, … ,110: 0.65, … , … , 999: 0.43 }
  • 70. ● Naïve scheme of JSON + Zlib works well: Observations import json import zlib ... # compress compressed = zlib.compress(json.dumps(value)) # decompress original = json.loads(zlib.decompress(compressed))
  • 71. ● Naïve scheme of JSON + Zlib works well: ● Double vs Float: do you really need to store that much precision? Observations import json import zlib ... # compress compressed = zlib.compress(json.dumps(value)) # decompress original = json.loads(zlib.decompress(compressed))
  • 72. ● Naïve scheme of JSON + Zlib works well: ● Double vs Float: do you really need to store that much precision? ● For more inspiration look to columnar DBs and how they compress columns Observations import json import zlib ... # compress compressed = zlib.compress(json.dumps(value)) # decompress original = json.loads(zlib.decompress(compressed))
  • 73. To Batch or Not To Batch: When is batch inefficient?
  • 74. ● Online: ○ Computation occurs synchronously when needed. ● Streamed: ○ Computation is triggered by an event(s). Online & Streamed Computation
  • 75. Online & Streamed Computation Very likely you start with a batch system
  • 76. Online & Streamed Computation ● Do you need to recompute: ○ features for all users? ○ predicted results for all users? Very likely you start with a batch system
  • 77. Online & Streamed Computation ● Do you need to recompute: ○ features for all users? ○ predicted results for all users? ● Are you heavily dependent on your ETL running every night? Very likely you start with a batch system
  • 78. Online & Streamed Computation ● Do you need to recompute: ○ features for all users? ○ predicted results for all users? ● Are you heavily dependent on your ETL running every night? ● Online vs Streamed depends on in house factors: ○ Number of models ○ How often they change ○ Cadence of output required ○ In house eng. expertise ○ etc. Very likely you start with a batch system
  • 79. Online & Streamed Computation ● Do you need to recompute: ○ features for all users? ○ predicted results for all users? ● Are you heavily dependent on your ETL running every night? ● Online vs Streamed depends on in house factors: ○ Number of models ○ How often they change ○ Cadence of output required ○ In house eng. expertise ○ etc. Very likely you start with a batch system We use online system for recommendations
  • 84. ● Dedicated infrastructure → More room on batch infrastructure ○ Hopefully $$$ savings ○ Hopefully less stressed Data Scientists Online/Streaming Thoughts
  • 85. ● Dedicated infrastructure → More room on batch infrastructure ○ Hopefully $$$ savings ○ Hopefully less stressed Data Scientists ● Requires better software engineering practices ○ Code portability/reuse ○ Designing APIs/Tools Data Scientists will use Online/Streaming Thoughts
  • 86. ● Dedicated infrastructure → More room on batch infrastructure ○ Hopefully $$$ savings ○ Hopefully less stressed Data Scientists ● Requires better software engineering practices ○ Code portability/reuse ○ Designing APIs/Tools Data Scientists will use ● Prototyping on AWS Lambda & Kinesis was surprisingly quick ○ Need to compile C libs on an amazon linux instance Online/Streaming Thoughts
  • 87. What’s in a Model? Scaling model knowledge
  • 88. Ever: ● Had someone leave and then nobody understands how they trained their models?
  • 89. Ever: ● Had someone leave and then nobody understands how they trained their models? ○ Or you didn’t remember yourself?
  • 90. Ever: ● Had someone leave and then nobody understands how they trained their models? ○ Or you didn’t remember yourself? ● Had performance dip in models and you have trouble figuring out why?
  • 91. Ever: ● Had someone leave and then nobody understands how they trained their models? ○ Or you didn’t remember yourself? ● Had performance dip in models and you have trouble figuring out why? ○ Or not known what’s changed between model deployments?
  • 92. Ever: ● Had someone leave and then nobody understands how they trained their models? ○ Or you didn’t remember yourself? ● Had performance dip in models and you have trouble figuring out why? ○ Or not known what’s changed between model deployments? ● Wanted to compare model performance over time?
  • 93. Ever: ● Had someone leave and then nobody understands how they trained their models? ○ Or you didn’t remember yourself? ● Had performance dip in models and you have trouble figuring out why? ○ Or not known what’s changed between model deployments? ● Wanted to compare model performance over time? ● Wanted to train a model in R/Python/Spark and then deploy it a webserver?
  • 95. ● Isn’t that just saving the coefficients/model values? Produce Model Artifacts
  • 96. ● Isn’t that just saving the coefficients/model values? ○ NO! Produce Model Artifacts
  • 97. ● Isn’t that just saving the coefficients/model values? ○ NO! ● Why? Produce Model Artifacts
  • 98. ● Isn’t that just saving the coefficients/model values? ○ NO! ● Why? Produce Model Artifacts
  • 99. ● Isn’t that just saving the coefficients/model values? ○ NO! ● Why? How do you deal with organizational drift? Produce Model Artifacts
  • 100. ● Isn’t that just saving the coefficients/model values? ○ NO! ● Why? How do you deal with organizational drift? Produce Model Artifacts Makes it easy to keep an archive and track changes over time
  • 101. ● Isn’t that just saving the coefficients/model values? ○ NO! ● Why? How do you deal with organizational drift? Produce Model Artifacts Helps a lot with model debugging & diagnosis! Makes it easy to keep an archive and track changes over time
  • 102. ● Isn’t that just saving the coefficients/model values? ○ NO! ● Why? How do you deal with organizational drift? Produce Model Artifacts Helps a lot with model debugging & diagnosis! Makes it easy to keep an archive and track changes over time Can more easily use in downstream processes
  • 103. ● Analogous to software libraries ● Packaging: ○ Zip/Jar file Produce Model Artifacts
  • 104. But all the above seems complex?
  • 107. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ stitchfix-cloud