SlideShare a Scribd company logo
1 of 92
Download to read offline
[Open Source] Hamilton, a micro
framework for creating dataframes,
and its application at Stitch Fix
Stefan Krawczyk, Mgr. Model Lifecycle Platform, Stitch Fix
What to keep in mind for the next ~30 minutes?
1. Hamilton is a new paradigm to create dataframes*.
2. Using Hamilton is a productivity boost for teams.
3. It’s open source - join us on:
Github: https://github.com/stitchfix/hamilton
Discord: https://discord.gg/wCqxqBqn73
#sfhamilton #MLOps #machinelearning 2
* in fact, any python object really.
Talk Outline:
> Backstory: who, what, & why
Hamilton
Hamilton @ Stitch Fix
Pro tips
Extensions
Backstory: who
#sfhamilton #MLOps #machinelearning 4
Forecasting, Estimation, & Demand (FED)Team
● Data Scientists that are responsible for forecasts that help the
business make operational decisions.
○ e.g. staffing levels
● One of the oldest teams at Stitch Fix.
Backstory: what
#sfhamilton #MLOps #machinelearning 5
Forecasting, Estimation, & Demand (FED)Team
FED workflow:
FED workflow: + ==
Backstory: what
#sfhamilton #MLOps #machinelearning 6
Forecasting, Estimation, & Demand Team
Backstory: what
#sfhamilton #MLOps #machinelearning
Creating a dataframe for time-series modeling.
7
Biggest problems here
Backstory: what
#sfhamilton #MLOps #machinelearning 8
Creating a dataframe for time-series modeling.
What
Hamilton
helped solve!
Biggest problems here
Backstory: why
#sfhamilton #MLOps #machinelearning 9
What is this dataframe & why is it causing 🔥
?
(not big data)
Backstory: why
#sfhamilton #MLOps #machinelearning 10
What is this dataframe & why is it causing 🔥 ?
Columns are
functions of
other columns
Backstory: why
#sfhamilton #MLOps #machinelearning 11
What is this dataframe & why is it causing 🔥 ?
g(f(A,B), …)
h(g(f(A,B), …), …)
etc🔥
Columns are
functions of
other columns:
Backstory: why
#sfhamilton #MLOps #machinelearning 12
Featurization: some example code
df = load_dates() # load date ranges
df = load_actuals(df) # load actuals, e.g. spend, signups
df['holidays'] = is_holiday(df['year'], df['week']) # holidays
df['avg_3wk_spend'] = df['spend'].rolling(3).mean() # moving average of spend
df['spend_per_signup'] = df['spend'] / df['signups'] # spend per person signed up
df['spend_shift_3weeks'] = df.spend['spend'].shift(3) # shift spend because ...
df['spend_shift_3weeks_per_signup'] = df['spend_shift_3weeks'] / df['signups']
def my_special_feature(df: pd.DataFrame) -> pd.Series:
return (df['A'] - df['B'] + df['C']) * weights
df['special_feature'] = my_special_feature(df)
# ...
Backstory: why
#sfhamilton #MLOps #machinelearning 13
Featurization: some example code
df = load_dates() # load date ranges
df = load_actuals(df) # load actuals, e.g. spend, signups
df['holidays'] = is_holiday(df['year'], df['week']) # holidays
df['avg_3wk_spend'] = df['spend'].rolling(3).mean() # moving average of spend
df['spend_per_signup'] = df['spend'] / df['signups'] # spend per person signed up
df['spend_shift_3weeks'] = df.spend['spend'].shift(3) # shift spend because ...
df['spend_shift_3weeks_per_signup'] = df['spend_shift_3weeks'] / df['signups']
def my_special_feature(df: pd.DataFrame) -> pd.Series:
return (df['A'] - df['B'] + df['C']) * weights
df['special_feature'] = my_special_feature(df)
# ...
Now scale this code to 1000+ columns & a growing team 😬
Backstory: why
○ Testing / Unit testing 👎
○ Documentation 👎
○ Code Reviews 👎
○ Onboarding 📈 👎
○ Debugging 📈 👎
#sfhamilton #MLOps #machinelearning 14
df = load_dates() # load date ranges
df = load_actuals(df) # load actuals, e.g. spend, signups
df['holidays'] = is_holiday(df['year'], df['week']) # holidays
df['avg_3wk_spend'] = df['spend'].rolling(3).mean() # moving average of spend
df['spend_per_signup'] = df['spend'] / df['signups'] # spend per person signed up
df['spend_shift_3weeks'] = df.spend['spend'].shift(3) # shift spend because ...
df['spend_shift_3weeks_per_signup'] = df['spend_shift_3weeks'] / df['signups']
def my_special_feature(df: pd.DataFrame) -> pd.Series:
return (df['A'] - df['B'] + df['C']) * weights
df['special_feature'] = my_special_feature(df)
# ...
Scaling this type of code results in the following:
- lots of heterogeneity in function definitions & behaviors
- inline dataframe manipulations
- code ordering is super important
Backstory - Summary
#sfhamilton #MLOps #machinelearning 15
Code for featurization == 🤯.
Talk Outline:
Backstory: who, what, & why
> Hamilton
The Outcome
Pro tips
Extensions
Hamilton: Code → Directed Acyclic Graph → DF
Code:
DAG:
DataFrame:
#sfhamilton #MLOps #machinelearning 17
def holidays(year: pd.Series, week: pd.Series) -> pd.Series:
"""Some docs"""
return some_library(year, week)
def avg_3wk_spend(spend: pd.Series) -> pd.Series:
"""Some docs"""
return spend.rolling(3).mean()
def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series:
"""Some docs"""
return spend / signups
def spend_shift_3weeks(spend: pd.Series) -> pd.Series:
"""Some docs"""
return spend.shift(3)
def spend_shift_3weeks_per_signup(spend_shift_3weeks: pd.Series, signups: pd.Series) -> pd.Series:
"""Some docs"""
return spend_shift_3weeks / signups
User
Hamilton
User
Hamilton: a new paradigm
1. Write functions!
2. Function name
== output column
3. Function inputs
== input columns
#sfhamilton #MLOps #machinelearning 18
Hamilton: a new paradigm
4. Use type hints for
typing checking.
5. Documentation is
easy and natural.
#sfhamilton #MLOps #machinelearning 19
Hamilton: code to directed acyclic graph - how?
1. Inspect module to
extract function
names &
parameters.
2. Nodes & edges +
graph theory 101.
#sfhamilton #MLOps #machinelearning 20
Hamilton: directed acyclic graph to DF - how?
1. Specify outputs &
provide inputs.
2. Determine execution
path.
3. Execute functions once.
4. Combine at the end.
#sfhamilton #MLOps #machinelearning 21
Hamilton: Key Point to remember (1)
Hamilton requires:
1. Function names
2. & Function parameter
names
to match to stitch together
a directed acyclic graph.
#sfhamilton #MLOps #machinelearning 22
Hamilton
Hamilton: Key Point to remember (2)
Hamilton users:
do not have to maintain
how to connect
computation with the
outputs required.
#sfhamilton #MLOps #machinelearning 23
def holidays(year: pd.Series, week: pd.Series) -> pd.Series:
"""Some docs"""
return some_library(year, week)
def avg_3wk_spend(spend: pd.Series) -> pd.Series:
"""Some docs"""
return spend.rolling(3).mean()
def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series:
"""Some docs"""
return spend / signups
def spend_shift_3weeks(spend: pd.Series) -> pd.Series:
"""Some docs"""
return spend.shift(3)
def spend_shift_3weeks_per_signup(spend_shift_3weeks: pd.Series, signups: pd.Series) -> pd.Series:
"""Some docs"""
return spend_shift_3weeks / signups
Hamilton: in one sentence
A user friendly dataflow paradigm.
#sfhamilton #MLOps #machinelearning 24
Hamilton: why is it called Hamilton?
#sfhamilton #MLOps #machinelearning
Naming things is hard...
1. Associations with “FED”:
a. Alexander Hamilton is the father of the Fed.
b. FED models business mechanics.
2. We’re doing some basic graph theory.
apropos Hamilton
25
Example Hamilton Code
So you can get a feel for this paradigm...
#sfhamilton #MLOps #machinelearning 26
Basic code - defining “Hamilton” functions
def holidays(year: pd.Series, week: pd.Series) -> pd.Series:
"""Some docs"""
return some_library(year, week)
def avg_3wk_spend(spend: pd.Series) -> pd.Series:
"""Some docs"""
return spend.rolling(3).mean()
def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series:
"""Some docs"""
return spend / signups
def spend_shift_3weeks(spend: pd.Series) -> pd.Series:
"""Some docs"""
return spend.shift(3)
def spend_shift_3weeks_per_signup(spend_shift_3weeks: pd.Series, signups: pd.Series) -> pd.Series:
"""Some docs"""
return spend_shift_3weeks / signups
#sfhamilton #MLOps #machinelearning 27
my_functions.py
Basic code - defining “Hamilton” functions
def holidays(year: pd.Series, week: pd.Series) -> pd.Series:
"""Some docs"""
return some_library(year, week)
def avg_3wk_spend(spend: pd.Series) -> pd.Series:
"""Some docs"""
return spend.rolling(3).mean()
def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series:
"""Some docs"""
return spend / signups
def spend_shift_3weeks(spend: pd.Series) -> pd.Series:
"""Some docs"""
return spend.shift(3)
def spend_shift_3weeks_per_signup(spend_shift_3weeks: pd.Series, signups: pd.Series) -> pd.Series:
"""Some docs"""
return spend_shift_3weeks / signups
#sfhamilton #MLOps #machinelearning 28
Output Column
Input Column
my_functions.py
Driver code - how do you get the Dataframe?
from hamilton import driver
config_and_initial_data = { # pass in config, initial data (or load data via funcs)
'C': 3, # a config variable
'signups': ..., # can pass in initial data – or pass in at execute time.
...
'year': ...
}
module_name = 'my_functions' # e.g. my_functions.py; can instead `import my_functions`
module = importlib.import_module(module_name) # The python file to crawl
dr = driver.Driver(config_and_initial_data, module) # can pass in multiple modules
output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature']
df = dr.execute(output_columns) # only walk DAG for what is needed
#sfhamilton #MLOps #machinelearning 29
Driver code - how do you get the Dataframe?
from hamilton import driver
config_and_initial_data = { # pass in config, initial data (or load data via funcs)
'C': 3, # a config variable
'signups': ..., # can pass in initial data – or pass in at execute time.
...
'year': ...
}
module_name = 'my_functions' # e.g. my_functions.py; can instead `import my_functions`
module = importlib.import_module(module_name) # The python file to crawl
dr = driver.Driver(config_and_initial_data, module) # can pass in multiple modules
output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature']
df = dr.execute(output_columns) # only walk DAG for what is needed
#sfhamilton #MLOps #machinelearning 30
Driver code - how do you get the Dataframe?
from hamilton import driver
config_and_initial_data = { # pass in config, initial data (or load data via funcs)
'C': 3, # a config variable
'signups': ..., # can pass in initial data – or pass in at execute time.
...
'year': ...
}
module_name = 'my_functions' # e.g. my_functions.py; can instead `import my_functions`
module = importlib.import_module(module_name) # The python file to crawl
dr = driver.Driver(config_and_initial_data, module) # can pass in multiple modules
output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature']
df = dr.execute(output_columns) # only walk DAG for what is needed
#sfhamilton #MLOps #machinelearning 31
Driver code - how do you get the Dataframe?
from hamilton import driver
config_and_initial_data = { # pass in config, initial data (or load data via funcs)
'C': 3, # a config variable
'signups': ..., # can pass in initial data – or pass in at execute time.
...
'year': ...
}
module_name = 'my_functions' # e.g. my_functions.py; can instead `import my_functions`
module = importlib.import_module(module_name) # The python file to crawl
dr = driver.Driver(config_and_initial_data, module) # can pass in multiple modules
output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature']
df = dr.execute(output_columns) # only walk DAG for what is needed
#sfhamilton #MLOps #machinelearning 32
Driver code - how do you get the Dataframe?
from hamilton import driver
config_and_initial_data = { # pass in config, initial data (or load data via funcs)
'C': 3, # a config variable
'signups': ..., # can pass in initial data – or pass in at execute time.
...
'year': ...
}
module_name = 'my_functions' # e.g. my_functions.py; can instead `import my_functions`
module = importlib.import_module(module_name) # The python file to crawl
dr = driver.Driver(config_and_initial_data, module) # can pass in multiple modules
output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature']
df = dr.execute(output_columns) # only walk DAG for what is needed
#sfhamilton #MLOps #machinelearning 33
Driver code - how do you get the Dataframe?
from hamilton import driver
config_and_initial_data = { # pass in config, initial data (or load data via funcs)
'C': 3, # a config variable
'signups': ..., # can pass in initial data – or pass in at execute() time.
...
'year': ...
}
module_name = 'my_functions' # e.g. my_functions.py; can instead `import my_functions`
module = importlib.import_module(module_name) # The python file to crawl
dr = driver.Driver(config_and_initial_data, module) # can pass in multiple modules
output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature']
df = dr.execute(output_columns) # only walk DAG for what is needed
#sfhamilton #MLOps #machinelearning 34
Driver code - how do you get the Dataframe?
from hamilton import driver
config_and_initial_data = { # pass in config, initial data (or load data via funcs)
'C': 3, # a config variable
'signups': ..., # can pass in initial data – or pass in at execute time.
...
'year': ...
}
module_name = 'my_functions' # e.g. my_functions.py; can instead `import my_functions`
module = importlib.import_module(module_name) # The python file to crawl
dr = driver.Driver(config_and_initial_data, module) # can pass in multiple modules
output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature']
df = dr.execute(output_columns) # only walk DAG for what is needed
dr.execute_visualization(output_columns, './dag.dot', {...render config…})
#sfhamilton #MLOps #machinelearning 35
Can visualize what we’re executing too!
Open Source: try it for yourself!
> pip install sf-hamilton
#sfhamilton #MLOps #machinelearning 36
Get started in <15 minutes!
Documentation - https://hamilton-docs.gitbook.io/
Example
https://github.com/stitchfix/hamilton/tree/main/examples/hello_world
Hamilton: Summary
1. A user friendly dataflow paradigm.
2. Users write functions that create a DAG
through function & parameter names.
3. Hamilton handles execution of the DAG.
#sfhamilton #MLOps #machinelearning 37
Talk Outline:
Backstory: who, what, & why
Hamilton
> Hamilton @ Stitch Fix
Pro tips
Extensions
Hamilton @ SF - after 2+ years in production
#sfhamilton #MLOps #machinelearning 39
Stitch Fix FED + Hamilton:
Original project goals:
● Improve ability to test
● Improve documentation
● Improve development workflow
#sfhamilton #MLOps #machinelearning 40
✅
✅
✅
Why was it a home run?
41
Testing & Documentation
Output “column” → One function:
1. Single place to find logic.
2. Single function that needs to be tested.
3. Function signature makes providing inputs very easy!
a. Function names & input parameters mean something!
4. Functions naturally come with a place for documentation!
⇒ Everything is naturally unit testable!
⇒ Everything is naturally documentation friendly!
#sfhamilton #MLOps #machinelearning 42
Workflow improvements
#sfhamilton #MLOps #machinelearning 43
What Hamilton also easily enabled:
● Ability to visualize computation
● Faster debug cycles
● Better Onboarding / Collaboration
○ Bonus:
■ Central Feature Definition Store
What if you have 4000+ columns to compute?
Visualization
#sfhamilton #MLOps #machinelearning 44
What if you have 4000+ columns to compute?
Hamilton makes this easy to visualize!
(zoomed out here to obscure names)
Visualization
#sfhamilton #MLOps #machinelearning 45
What if you have 4000+ columns to compute?
Hamilton makes this easy to visualize!
(zoomed out here to obscure names)
Visualization
#sfhamilton #MLOps #machinelearning 46
can create `DOT` files for export to
other visualization packages →
Debugging these functions is easy!
def holidays(year: pd.Series, week: pd.Series) -> pd.Series:
"""Some docs"""
return some_library(year, week)
def avg_3wk_spend(spend: pd.Series) -> pd.Series:
"""Some docs"""
return spend.rolling(3).mean()
def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series:
"""Some docs"""
return spend / signups
def spend_shift_3weeks(spend: pd.Series) -> pd.Series:
"""Some docs"""
return spend.shift(3)
def spend_shift_3weeks_per_signup(spend_shift_3weeks: pd.Series, signups: pd.Series) -> pd.Series:
"""Some docs"""
return spend_shift_3weeks / signups
#sfhamilton #MLOps #machinelearning 47
my_functions.py
Can also import functions into other contexts to help debug.
e.g. in your REPL:
from my_functions import spend_shift_3weeks
output = spend_shift_3weeks(...)
Collaborating on these functions is easy!
def holidays(year: pd.Series, week: pd.Series) -> pd.Series:
"""Some docs"""
return some_library(year, week)
def avg_3wk_spend(spend: pd.Series) -> pd.Series:
"""Some docs"""
return spend.rolling(3).mean()
def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series:
"""Some docs"""
return spend / signups
def spend_shift_3weeks(spend: pd.Series) -> pd.Series:
"""Some docs"""
return spend.shift(3)
def spend_shift_3weeks_per_signup(spend_shift_3weeks: pd.Series, signups: pd.Series) -> pd.Series:
"""Some docs"""
return spend_shift_3weeks / signups
#sfhamilton #MLOps #machinelearning 48
my_functions.py
Easy to assess impact & changes when:
● names mean something
● adding a new input
● changing the name of a function
● adding a brand new function
● deleting a function
⇒ Code reviews are much faster!
⇒ Easy to pick up where others left off!
Stitch Fix FED’s Central Feature Definition Store
A nice byproduct of using Hamilton!
How they use it:
1. Function names follow team convention.
a. e.g. D_ prefix indicates date feature
#sfhamilton #MLOps #machinelearning 49
Stitch Fix FED’s Central Feature Definition Store
A nice byproduct of using Hamilton!
How they use it:
1. Function names follow team convention.
2. It’s organized into thematic modules, e.g. date_features.py.
a. Allows for working on different part of the DAG easily
#sfhamilton #MLOps #machinelearning 50
Stitch Fix FED’s Central Feature Definition Store
A nice byproduct of using Hamilton!
How they use it:
1. Function names follow team convention.
2. It’s organized into thematic modules, e.g. date_features.py.
3. It’s in a central repository & versioned by git:
a. Can easily find/use/reuse features!
b. Can recreate features from different points in time easily.
#sfhamilton #MLOps #machinelearning 51
FED Testimonials
Just incase you don’t believe me
#sfhamilton #MLOps #machinelearning 52
Testimonial (1)
Danielle Q.
“the encapsulation of the logic in a single named function makes
adding nodes/edges simple to understand, communicate, and transfer
knowledge”
E.g.:
● Pull Requests are easy to review.
● Onboarding is easy.
#sfhamilton #MLOps #machinelearning 53
Testimonial (2)
Shelly J.
“I like how easy-breezy it is to add new nodes/edges to the DAG to
support evolving business needs.”
E.g.
● new marketing push & we need to add a new feature:
○ this takes minutes, not hours!
#sfhamilton #MLOps #machinelearning 54
Hamilton @ Stitch Fix
FED Impact Summary
#sfhamilton #MLOps #machinelearning 55
FED Impact Summary
With Hamilton, the FED Team gained:
● Naturally testable code. Always. ✅
● Naturally documentable code. ✅
● Dataflow visualization for free. ✅
● Faster debug cycles. ✅
● A better onboarding & collaboration experience ✅
○ Central Feature Definition Store as a by product! ✅
-----------------------------------------------------------------------
Total ⚾ Home Run!
#sfhamilton #MLOps #machinelearning 56
FED Impact Summary
With Hamilton, the FED Team gained:
● Naturally testable code. Always. ✅
● Naturally documentable code. ✅
● Dataflow visualization for free. ✅
● Faster debug cycles. ✅
● A better onboarding & collaboration experience ✅
○ Central Feature Definition Store as a by product! ✅
-----------------------------------------------------------------------
Total ⚾ Home Run!
#sfhamilton #MLOps #machinelearning 57
[claim]
By using Hamilton, the FED team can
continue to scale their code base,
without impacting team productivity
[/claim]
Question: is that true of your feature code base?
Talk Outline:
Backstory: who, what, & why
Hamilton
Hamilton @ Stitch Fix
> Pro Tips
Extensions
Pro Tips - Five things to help you use Hamilton
1. Using it within your own ETL system
2. Migrating to Hamilton
3. Three key things to grok
4. Code organization & python modules
5. Function modifiers
#sfhamilton #MLOps #machinelearning 59
1. Using Hamilton within your ETL system
ETL Framework compatibility:
● all ETL systems that run python 3.6+.
E.g. Airflow ✅
Metaflow ✅
Dagster ✅
Prefect ✅
Kubeflow ✅
etc. ✅
#sfhamilton #MLOps #machinelearning 60
1. Using Hamilton within your ETL system
ETL Recipe:
1. Write Hamilton functions & “driver” code.
2. Publish your Hamilton functions in a package,
or import via other means (e.g. checkout a repository).
3. Include sf-hamilton as a python dependency
4. Have your ETL system execute your “driver” code.
5. Profit.
#sfhamilton #MLOps #machinelearning 61
2. Migrating to Hamilton: (1) CI for comparisons
Create a way to easily & frequently compare results.
1. Integrate with continuous integration (CI) system if you can.
2. 🔎🐛 Helps diagnose bugs in your old & new code early &
often.
#sfhamilton #MLOps #machinelearning 62
2. Migrating to Hamilton: (2) Custom Wrapper
Wrap Hamilton in a custom class to match your existing API.
1. When migrating, avoid making too many changes.
2. Allows you to easily insert Hamilton into your context.
#sfhamilton #MLOps #machinelearning 63
3. Key Concepts to Grok: (1) Common Index
If creating a DataFrame as output:
● Hamilton relies on the series index to join columns properly.
Best practice:
1. Load data.
2. Transform/ensure indexes match.
3. Continue with transformations.
At Stitch Fix – this meant a common DateTime index.
#sfhamilton #MLOps #machinelearning 64
3. Key Concepts to Grok (2): Naming
Function Naming:
1. Creates your DAG.
2. Drives collaboration & code reuse.
3. Serves as documentation itself.
Key thought:
● Don’t need to get this right the first time
○ Can easily search & replace code as your thinking evolves.
● But it is something to converge thinking on!
#sfhamilton #MLOps #machinelearning 65
3. Key Concepts to Grok: (3) Output immutability
Functions are only called once:
1. To preserve “immutability” of outputs,
don’t mutate passed in data structures.
e.g. if you get passed in a pandas series, don’t mutate it.
2. Otherwise YMMV with debugging.
Best practice:
1. Test for this in your unit tests!
2. Clearly document mutating inputs if you do.
#sfhamilton #MLOps #machinelearning 66
1. Functions are grouped into modules.
2. Modules are used as input to create a DAG.
Use this to your development advantage!
1. Use modules to model team thinking, e.g. date_features.py.
2. Helps isolate what you’re working on.
3. Enables you to replace parts of your
DAG easily for different contexts.
dr = driver.Driver(config_and_initial_data, dates, marketing, spend)
4. Code Organization & Python Modules
#sfhamilton #MLOps #machinelearning 67
5. Function Modifiers; a.k.a. decorators
The @(...) above a function:
● Hamilton has a bunch to modify function behavior [docs]
#sfhamilton #MLOps #machinelearning 68
from hamilton.function_modifiers import extract_columns
@extract_columns(*my_list_of_column_names)
def load_spend_data(location: str) -> pd.DataFrame:
"""Some docs"""
return pd.read_csv(location, ...)
Learn to use them:
● Functionality:
○ e.g. splitting a dataframe into columns
● Keeping code DRY
● FED favorite @config.when
Talk Outline:
Backstory: who, what, & why
Hamilton
Hamilton @ Stitch Fix
Pro Tips
> Extensions
Extensions - Why?
Initial Hamilton shortcomings:
1. Single threaded.
2. Could not scale to “big data”.
3. Could only produce Pandas DataFrames.
4. Does not leverage all the richness of metadata in the graph.
#sfhamilton #MLOps #machinelearning 70
Extensions
1. Recent work
○ Scaling Computation
○ Removing the need for pandas
○ “Row based” execution
2. Planned extensions
#sfhamilton #MLOps #machinelearning 71
Extensions: Recent Work
Covering a few things we recently released
#sfhamilton #MLOps #machinelearning 72
Extensions - Scaling Computation
Hamilton grew up with a single core, in memory limitation
● Blocker to adoption for some.
Goal: to not modify Hamilton code to scale.
E.g. for creating Pandas DFs “it should just work” (on Spark, Dask, Ray, etc.)
#sfhamilton #MLOps #machinelearning 73
Take this code – and scale it without changing it
def holidays(year: pd.Series, week: pd.Series) -> pd.Series:
"""Some docs"""
return some_library(year, week)
def avg_3wk_spend(spend: pd.Series) -> pd.Series:
"""Some docs"""
return spend.rolling(3).mean()
def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series:
"""Some docs"""
return spend / signups
def spend_shift_3weeks(spend: pd.Series) -> pd.Series:
"""Some docs"""
return spend.shift(3)
def spend_shift_3weeks_per_signup(spend_shift_3weeks: pd.Series, signups: pd.Series) -> pd.Series:
"""Some docs"""
return spend_shift_3weeks / signups
#sfhamilton #MLOps #machinelearning 74
my_functions.py
Extensions - Scaling Computation
Hamilton grew up with a single core, in memory limitation
● Blocker to adoption for some.
Goal: to not modify Hamilton code to scale.
⭐ Lucky for us:
● Hamilton functions are generally very amenable for
distributed computation.
● Pandas has a lot of support for scaling.
#sfhamilton #MLOps #machinelearning 75
Extensions - Scaling Computation
What’s in the 1.3.0 Release:
● Experimental versions of Hamilton on:
○ Dask (cores + data)
○ Koalas [Pandas API] on Spark 3.2+ (cores + data)
○ Ray (cores + data*)
TL;DR:
● Can scale Pandas** out of the box!
#sfhamilton #MLOps #machinelearning 76
* Cluster dependent
** Pandas use & Dask/Koalas dependent
Just how easy it is:
Example using Dask – only modify the “driver” script
from dask.distributed import Client
from hamilton import driver
from hamilton.experimental import h_dask
dag_config = {...}
bl_module = importlib.import_module('my_functions') # business logic functions
loader_module = importlib.import_module('data_loader') # functions to load data
client = Client(...)
adapter = h_dask.DaskGraphAdapter(client)
dr = driver.Driver(dag_config, bl_module, loader_module, adapter=adapter)
output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature']
df = dr.execute(output_columns) # only walk DAG for what is needed
#sfhamilton #MLOps #machinelearning 77
Extensions - Custom Return Objects
What if I don’t want a Pandas dataframe returned?
What’s in the 1.3.0 Release:
● Control over what the final object is returned!
E.g.
● Dictionary
● Numpy matrix
● Your custom object!
#sfhamilton #MLOps #machinelearning 78
Just how easy it is:
Example Custom Object– only modify “driver” script
from dask.distributed import Client
from hamilton import driver
from hamilton import base
dag_config = {...}
bl_module = importlib.import_module('my_functions') # business logic functions
loader_module = importlib.import_module('data_loader') # functions to load data
adapter = base.SimplePythonGraphAdapter(base.DictResult())# or your custom class
dr = driver.Driver(dag_config, bl_module, loader_module, adapter=adapter)
output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature']
# creates a dict of {col -> function result}
result_dict = dr.execute(output_columns)
#sfhamilton #MLOps #machinelearning 79
Extensions - “Row Based” Execution
What if:
● I can’t fit everything into memory?
● Want to reuse my graph and call execute within a for loop with
differing input?
What’s in the 1.3.0 Release:
1. Enables you to configure the DAG once,
2. Then call `.execute()` with different inputs.
Enables data chunking & use cases like image processing or NLP.
#sfhamilton #MLOps #machinelearning 80
Just how easy it is:
Example Row Execution– only modify “driver” script
from hamilton import driver
config_and_initial_data = {...}
module_name = 'my_functions' # e.g. my_functions.py; can instead `import my_functions`
module = importlib.import_module(module_name) # The python file to crawl
dr = driver.Driver(config_and_initial_data, module) # instantiate driver once.
output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature']
dataset = load_dataset()
for data_chunk in dataset:
df = dr.execute(output_columns, inputs=data_chunk) # rerun execute on data chunks
print(df)
#sfhamilton #MLOps #machinelearning 81
Extensions - Recent Work Summary
Available as of 1.3.0 release:
● Distributed execution
○ Experimental versions of: Dask, Koalas on Spark, Ray
● General purpose framework with custom return objects:
○ Can return [numpy, pytorch, dicts, etc]
● Row based execution!
○ Chunk over large data sets
○ Process things one at at time, e.g. images, text.
#sfhamilton #MLOps #machinelearning 82
Extensions: Planned Work
What we’re thinking about next
#sfhamilton #MLOps #machinelearning 83
In no particular order:
● Numba integration (github issue)
● Data quality ala pandera (github issue)
● Lineage surfacing tools (github issue)
Extensions - Planned Work
#sfhamilton #MLOps #machinelearning 84
Extensions - Planned Work
Numba:
- Numba makes your code run much faster.
Task: wrap Hamilton functions with numba.jit and compile the
graph for speedy execution!
E.g. Scale your numpy & simple python code to:
● GPUs
● C/Fortran like speeds!
#sfhamilton #MLOps #machinelearning 85
Extensions - Planned Work
Data Quality:
- Runtime inspection of data is a possibility.
Task: incorporate expectations, ala Pandera, on functions.
e.g.
#sfhamilton #MLOps #machinelearning 86
@check_output({'type': float, 'range': (0.0, 10000.0)})
def SOME_IMPORTANT_OUTPUT(input1: pd.Series, input2: pd.Series) -> pd.Series:
"""Does some complex logic"""
Extensions - Planned Work
Lineage surfacing tools:
- Want to ask questions of the metadata we have
Task: provide classes/functions to expose this information.
E.g.
GDPR/PII questions:
● Where is this PII used and how?
Development questions:
● What happens if I change X, what impacts could it have?, etc.
#sfhamilton #MLOps #machinelearning 87
Extensions - Planned Work
Please vote (❤, 👍, etc) for what extensions we should
prioritize!
#sfhamilton #MLOps #machinelearning 88
https://github.com/stitchfix/hamilton/issues
To Conclude
Some TL:DRs
#sfhamilton #MLOps #machinelearning 89
To Conclude
1. Hamilton is a new paradigm to describe data flows.
2. It grew out of a need to tame a feature code base.
3. The Hamilton paradigm can provide teams with multiple
productivity improvements & scales with code bases.
4. With the 1.3.0 release it’s now a scalable general purpose
framework.
#sfhamilton #MLOps #machinelearning 90
Thanks for listening – would love your feedback!
#sfhamilton #MLOps #machinelearning 91
> pip install sf-hamilton
⭐ on github
☑ create & vote on issues on github
📣 join us on discord
(https://discord.gg/wCqxqBqn73)
Thank you! Questions?
Try out Stitch Fix → goo.gl/Q3tCQ3
@stefkrawczyk
linkedin.com/in/skrawczyk

More Related Content

What's hot

Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
Robert Grossman
 

What's hot (20)

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Shrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_youShrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_you
 
Audience counting at Scale
Audience counting at ScaleAudience counting at Scale
Audience counting at Scale
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
S1 DML Syntax and Invocation
S1 DML Syntax and InvocationS1 DML Syntax and Invocation
S1 DML Syntax and Invocation
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Boosting command line experience with python and awk
Boosting command line experience with python and awkBoosting command line experience with python and awk
Boosting command line experience with python and awk
 
Scalding
ScaldingScalding
Scalding
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 
Processing large-scale graphs with Google(TM) Pregel
Processing large-scale graphs with Google(TM) PregelProcessing large-scale graphs with Google(TM) Pregel
Processing large-scale graphs with Google(TM) Pregel
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat Sheet
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 

Similar to [open source] hamilton, a micro framework for creating dataframes, and its application at stitch fix

Debugging and Error handling
Debugging and Error handlingDebugging and Error handling
Debugging and Error handling
Suite Solutions
 

Similar to [open source] hamilton, a micro framework for creating dataframes, and its application at stitch fix (20)

Final project kijtorntham n
Final project kijtorntham nFinal project kijtorntham n
Final project kijtorntham n
 
report
reportreport
report
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
Class 12 computer sample paper with answers
Class 12 computer sample paper with answersClass 12 computer sample paper with answers
Class 12 computer sample paper with answers
 
Como ser um bom administrador de team foundation server vinicius moura
Como ser um bom administrador de team foundation server   vinicius mouraComo ser um bom administrador de team foundation server   vinicius moura
Como ser um bom administrador de team foundation server vinicius moura
 
Working With The Symfony Admin Generator
Working With The Symfony Admin GeneratorWorking With The Symfony Admin Generator
Working With The Symfony Admin Generator
 
Código Saudável => Programador Feliz - Rs on Rails 2010
Código Saudável => Programador Feliz - Rs on Rails 2010Código Saudável => Programador Feliz - Rs on Rails 2010
Código Saudável => Programador Feliz - Rs on Rails 2010
 
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
 
Ruby on rails
Ruby on rails Ruby on rails
Ruby on rails
 
Company segmentation - an approach with R
Company segmentation - an approach with RCompany segmentation - an approach with R
Company segmentation - an approach with R
 
Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2
 
Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data Warehouse
 
Debugging and Error handling
Debugging and Error handlingDebugging and Error handling
Debugging and Error handling
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Testing in the World of Functional Programming
Testing in the World of Functional ProgrammingTesting in the World of Functional Programming
Testing in the World of Functional Programming
 
Unit 5 Time series Data Analysis.pdf
Unit 5 Time series Data Analysis.pdfUnit 5 Time series Data Analysis.pdf
Unit 5 Time series Data Analysis.pdf
 
Unit-5 Time series data Analysis.pptx
Unit-5 Time series data Analysis.pptxUnit-5 Time series data Analysis.pptx
Unit-5 Time series data Analysis.pptx
 
Building source code level profiler for C++.pdf
Building source code level profiler for C++.pdfBuilding source code level profiler for C++.pdf
Building source code level profiler for C++.pdf
 
Dolibarr - information for developers and partners - devcamp Pau 2019
Dolibarr - information for developers and partners - devcamp Pau 2019Dolibarr - information for developers and partners - devcamp Pau 2019
Dolibarr - information for developers and partners - devcamp Pau 2019
 
Webinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationWebinar: Performance Tuning + Optimization
Webinar: Performance Tuning + Optimization
 

Recently uploaded

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 

[open source] hamilton, a micro framework for creating dataframes, and its application at stitch fix

  • 1. [Open Source] Hamilton, a micro framework for creating dataframes, and its application at Stitch Fix Stefan Krawczyk, Mgr. Model Lifecycle Platform, Stitch Fix
  • 2. What to keep in mind for the next ~30 minutes? 1. Hamilton is a new paradigm to create dataframes*. 2. Using Hamilton is a productivity boost for teams. 3. It’s open source - join us on: Github: https://github.com/stitchfix/hamilton Discord: https://discord.gg/wCqxqBqn73 #sfhamilton #MLOps #machinelearning 2 * in fact, any python object really.
  • 3. Talk Outline: > Backstory: who, what, & why Hamilton Hamilton @ Stitch Fix Pro tips Extensions
  • 4. Backstory: who #sfhamilton #MLOps #machinelearning 4 Forecasting, Estimation, & Demand (FED)Team ● Data Scientists that are responsible for forecasts that help the business make operational decisions. ○ e.g. staffing levels ● One of the oldest teams at Stitch Fix.
  • 5. Backstory: what #sfhamilton #MLOps #machinelearning 5 Forecasting, Estimation, & Demand (FED)Team FED workflow:
  • 6. FED workflow: + == Backstory: what #sfhamilton #MLOps #machinelearning 6 Forecasting, Estimation, & Demand Team
  • 7. Backstory: what #sfhamilton #MLOps #machinelearning Creating a dataframe for time-series modeling. 7 Biggest problems here
  • 8. Backstory: what #sfhamilton #MLOps #machinelearning 8 Creating a dataframe for time-series modeling. What Hamilton helped solve! Biggest problems here
  • 9. Backstory: why #sfhamilton #MLOps #machinelearning 9 What is this dataframe & why is it causing 🔥 ? (not big data)
  • 10. Backstory: why #sfhamilton #MLOps #machinelearning 10 What is this dataframe & why is it causing 🔥 ? Columns are functions of other columns
  • 11. Backstory: why #sfhamilton #MLOps #machinelearning 11 What is this dataframe & why is it causing 🔥 ? g(f(A,B), …) h(g(f(A,B), …), …) etc🔥 Columns are functions of other columns:
  • 12. Backstory: why #sfhamilton #MLOps #machinelearning 12 Featurization: some example code df = load_dates() # load date ranges df = load_actuals(df) # load actuals, e.g. spend, signups df['holidays'] = is_holiday(df['year'], df['week']) # holidays df['avg_3wk_spend'] = df['spend'].rolling(3).mean() # moving average of spend df['spend_per_signup'] = df['spend'] / df['signups'] # spend per person signed up df['spend_shift_3weeks'] = df.spend['spend'].shift(3) # shift spend because ... df['spend_shift_3weeks_per_signup'] = df['spend_shift_3weeks'] / df['signups'] def my_special_feature(df: pd.DataFrame) -> pd.Series: return (df['A'] - df['B'] + df['C']) * weights df['special_feature'] = my_special_feature(df) # ...
  • 13. Backstory: why #sfhamilton #MLOps #machinelearning 13 Featurization: some example code df = load_dates() # load date ranges df = load_actuals(df) # load actuals, e.g. spend, signups df['holidays'] = is_holiday(df['year'], df['week']) # holidays df['avg_3wk_spend'] = df['spend'].rolling(3).mean() # moving average of spend df['spend_per_signup'] = df['spend'] / df['signups'] # spend per person signed up df['spend_shift_3weeks'] = df.spend['spend'].shift(3) # shift spend because ... df['spend_shift_3weeks_per_signup'] = df['spend_shift_3weeks'] / df['signups'] def my_special_feature(df: pd.DataFrame) -> pd.Series: return (df['A'] - df['B'] + df['C']) * weights df['special_feature'] = my_special_feature(df) # ... Now scale this code to 1000+ columns & a growing team 😬
  • 14. Backstory: why ○ Testing / Unit testing 👎 ○ Documentation 👎 ○ Code Reviews 👎 ○ Onboarding 📈 👎 ○ Debugging 📈 👎 #sfhamilton #MLOps #machinelearning 14 df = load_dates() # load date ranges df = load_actuals(df) # load actuals, e.g. spend, signups df['holidays'] = is_holiday(df['year'], df['week']) # holidays df['avg_3wk_spend'] = df['spend'].rolling(3).mean() # moving average of spend df['spend_per_signup'] = df['spend'] / df['signups'] # spend per person signed up df['spend_shift_3weeks'] = df.spend['spend'].shift(3) # shift spend because ... df['spend_shift_3weeks_per_signup'] = df['spend_shift_3weeks'] / df['signups'] def my_special_feature(df: pd.DataFrame) -> pd.Series: return (df['A'] - df['B'] + df['C']) * weights df['special_feature'] = my_special_feature(df) # ... Scaling this type of code results in the following: - lots of heterogeneity in function definitions & behaviors - inline dataframe manipulations - code ordering is super important
  • 15. Backstory - Summary #sfhamilton #MLOps #machinelearning 15 Code for featurization == 🤯.
  • 16. Talk Outline: Backstory: who, what, & why > Hamilton The Outcome Pro tips Extensions
  • 17. Hamilton: Code → Directed Acyclic Graph → DF Code: DAG: DataFrame: #sfhamilton #MLOps #machinelearning 17 def holidays(year: pd.Series, week: pd.Series) -> pd.Series: """Some docs""" return some_library(year, week) def avg_3wk_spend(spend: pd.Series) -> pd.Series: """Some docs""" return spend.rolling(3).mean() def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend / signups def spend_shift_3weeks(spend: pd.Series) -> pd.Series: """Some docs""" return spend.shift(3) def spend_shift_3weeks_per_signup(spend_shift_3weeks: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend_shift_3weeks / signups User Hamilton User
  • 18. Hamilton: a new paradigm 1. Write functions! 2. Function name == output column 3. Function inputs == input columns #sfhamilton #MLOps #machinelearning 18
  • 19. Hamilton: a new paradigm 4. Use type hints for typing checking. 5. Documentation is easy and natural. #sfhamilton #MLOps #machinelearning 19
  • 20. Hamilton: code to directed acyclic graph - how? 1. Inspect module to extract function names & parameters. 2. Nodes & edges + graph theory 101. #sfhamilton #MLOps #machinelearning 20
  • 21. Hamilton: directed acyclic graph to DF - how? 1. Specify outputs & provide inputs. 2. Determine execution path. 3. Execute functions once. 4. Combine at the end. #sfhamilton #MLOps #machinelearning 21
  • 22. Hamilton: Key Point to remember (1) Hamilton requires: 1. Function names 2. & Function parameter names to match to stitch together a directed acyclic graph. #sfhamilton #MLOps #machinelearning 22
  • 23. Hamilton Hamilton: Key Point to remember (2) Hamilton users: do not have to maintain how to connect computation with the outputs required. #sfhamilton #MLOps #machinelearning 23 def holidays(year: pd.Series, week: pd.Series) -> pd.Series: """Some docs""" return some_library(year, week) def avg_3wk_spend(spend: pd.Series) -> pd.Series: """Some docs""" return spend.rolling(3).mean() def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend / signups def spend_shift_3weeks(spend: pd.Series) -> pd.Series: """Some docs""" return spend.shift(3) def spend_shift_3weeks_per_signup(spend_shift_3weeks: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend_shift_3weeks / signups
  • 24. Hamilton: in one sentence A user friendly dataflow paradigm. #sfhamilton #MLOps #machinelearning 24
  • 25. Hamilton: why is it called Hamilton? #sfhamilton #MLOps #machinelearning Naming things is hard... 1. Associations with “FED”: a. Alexander Hamilton is the father of the Fed. b. FED models business mechanics. 2. We’re doing some basic graph theory. apropos Hamilton 25
  • 26. Example Hamilton Code So you can get a feel for this paradigm... #sfhamilton #MLOps #machinelearning 26
  • 27. Basic code - defining “Hamilton” functions def holidays(year: pd.Series, week: pd.Series) -> pd.Series: """Some docs""" return some_library(year, week) def avg_3wk_spend(spend: pd.Series) -> pd.Series: """Some docs""" return spend.rolling(3).mean() def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend / signups def spend_shift_3weeks(spend: pd.Series) -> pd.Series: """Some docs""" return spend.shift(3) def spend_shift_3weeks_per_signup(spend_shift_3weeks: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend_shift_3weeks / signups #sfhamilton #MLOps #machinelearning 27 my_functions.py
  • 28. Basic code - defining “Hamilton” functions def holidays(year: pd.Series, week: pd.Series) -> pd.Series: """Some docs""" return some_library(year, week) def avg_3wk_spend(spend: pd.Series) -> pd.Series: """Some docs""" return spend.rolling(3).mean() def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend / signups def spend_shift_3weeks(spend: pd.Series) -> pd.Series: """Some docs""" return spend.shift(3) def spend_shift_3weeks_per_signup(spend_shift_3weeks: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend_shift_3weeks / signups #sfhamilton #MLOps #machinelearning 28 Output Column Input Column my_functions.py
  • 29. Driver code - how do you get the Dataframe? from hamilton import driver config_and_initial_data = { # pass in config, initial data (or load data via funcs) 'C': 3, # a config variable 'signups': ..., # can pass in initial data – or pass in at execute time. ... 'year': ... } module_name = 'my_functions' # e.g. my_functions.py; can instead `import my_functions` module = importlib.import_module(module_name) # The python file to crawl dr = driver.Driver(config_and_initial_data, module) # can pass in multiple modules output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature'] df = dr.execute(output_columns) # only walk DAG for what is needed #sfhamilton #MLOps #machinelearning 29
  • 30. Driver code - how do you get the Dataframe? from hamilton import driver config_and_initial_data = { # pass in config, initial data (or load data via funcs) 'C': 3, # a config variable 'signups': ..., # can pass in initial data – or pass in at execute time. ... 'year': ... } module_name = 'my_functions' # e.g. my_functions.py; can instead `import my_functions` module = importlib.import_module(module_name) # The python file to crawl dr = driver.Driver(config_and_initial_data, module) # can pass in multiple modules output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature'] df = dr.execute(output_columns) # only walk DAG for what is needed #sfhamilton #MLOps #machinelearning 30
  • 31. Driver code - how do you get the Dataframe? from hamilton import driver config_and_initial_data = { # pass in config, initial data (or load data via funcs) 'C': 3, # a config variable 'signups': ..., # can pass in initial data – or pass in at execute time. ... 'year': ... } module_name = 'my_functions' # e.g. my_functions.py; can instead `import my_functions` module = importlib.import_module(module_name) # The python file to crawl dr = driver.Driver(config_and_initial_data, module) # can pass in multiple modules output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature'] df = dr.execute(output_columns) # only walk DAG for what is needed #sfhamilton #MLOps #machinelearning 31
  • 32. Driver code - how do you get the Dataframe? from hamilton import driver config_and_initial_data = { # pass in config, initial data (or load data via funcs) 'C': 3, # a config variable 'signups': ..., # can pass in initial data – or pass in at execute time. ... 'year': ... } module_name = 'my_functions' # e.g. my_functions.py; can instead `import my_functions` module = importlib.import_module(module_name) # The python file to crawl dr = driver.Driver(config_and_initial_data, module) # can pass in multiple modules output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature'] df = dr.execute(output_columns) # only walk DAG for what is needed #sfhamilton #MLOps #machinelearning 32
  • 33. Driver code - how do you get the Dataframe? from hamilton import driver config_and_initial_data = { # pass in config, initial data (or load data via funcs) 'C': 3, # a config variable 'signups': ..., # can pass in initial data – or pass in at execute time. ... 'year': ... } module_name = 'my_functions' # e.g. my_functions.py; can instead `import my_functions` module = importlib.import_module(module_name) # The python file to crawl dr = driver.Driver(config_and_initial_data, module) # can pass in multiple modules output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature'] df = dr.execute(output_columns) # only walk DAG for what is needed #sfhamilton #MLOps #machinelearning 33
  • 34. Driver code - how do you get the Dataframe? from hamilton import driver config_and_initial_data = { # pass in config, initial data (or load data via funcs) 'C': 3, # a config variable 'signups': ..., # can pass in initial data – or pass in at execute() time. ... 'year': ... } module_name = 'my_functions' # e.g. my_functions.py; can instead `import my_functions` module = importlib.import_module(module_name) # The python file to crawl dr = driver.Driver(config_and_initial_data, module) # can pass in multiple modules output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature'] df = dr.execute(output_columns) # only walk DAG for what is needed #sfhamilton #MLOps #machinelearning 34
  • 35. Driver code - how do you get the Dataframe? from hamilton import driver config_and_initial_data = { # pass in config, initial data (or load data via funcs) 'C': 3, # a config variable 'signups': ..., # can pass in initial data – or pass in at execute time. ... 'year': ... } module_name = 'my_functions' # e.g. my_functions.py; can instead `import my_functions` module = importlib.import_module(module_name) # The python file to crawl dr = driver.Driver(config_and_initial_data, module) # can pass in multiple modules output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature'] df = dr.execute(output_columns) # only walk DAG for what is needed dr.execute_visualization(output_columns, './dag.dot', {...render config…}) #sfhamilton #MLOps #machinelearning 35 Can visualize what we’re executing too!
  • 36. Open Source: try it for yourself! > pip install sf-hamilton #sfhamilton #MLOps #machinelearning 36 Get started in <15 minutes! Documentation - https://hamilton-docs.gitbook.io/ Example https://github.com/stitchfix/hamilton/tree/main/examples/hello_world
  • 37. Hamilton: Summary 1. A user friendly dataflow paradigm. 2. Users write functions that create a DAG through function & parameter names. 3. Hamilton handles execution of the DAG. #sfhamilton #MLOps #machinelearning 37
  • 38. Talk Outline: Backstory: who, what, & why Hamilton > Hamilton @ Stitch Fix Pro tips Extensions
  • 39. Hamilton @ SF - after 2+ years in production #sfhamilton #MLOps #machinelearning 39
  • 40. Stitch Fix FED + Hamilton: Original project goals: ● Improve ability to test ● Improve documentation ● Improve development workflow #sfhamilton #MLOps #machinelearning 40 ✅ ✅ ✅
  • 41. Why was it a home run? 41
  • 42. Testing & Documentation Output “column” → One function: 1. Single place to find logic. 2. Single function that needs to be tested. 3. Function signature makes providing inputs very easy! a. Function names & input parameters mean something! 4. Functions naturally come with a place for documentation! ⇒ Everything is naturally unit testable! ⇒ Everything is naturally documentation friendly! #sfhamilton #MLOps #machinelearning 42
  • 43. Workflow improvements #sfhamilton #MLOps #machinelearning 43 What Hamilton also easily enabled: ● Ability to visualize computation ● Faster debug cycles ● Better Onboarding / Collaboration ○ Bonus: ■ Central Feature Definition Store
  • 44. What if you have 4000+ columns to compute? Visualization #sfhamilton #MLOps #machinelearning 44
  • 45. What if you have 4000+ columns to compute? Hamilton makes this easy to visualize! (zoomed out here to obscure names) Visualization #sfhamilton #MLOps #machinelearning 45
  • 46. What if you have 4000+ columns to compute? Hamilton makes this easy to visualize! (zoomed out here to obscure names) Visualization #sfhamilton #MLOps #machinelearning 46 can create `DOT` files for export to other visualization packages →
  • 47. Debugging these functions is easy! def holidays(year: pd.Series, week: pd.Series) -> pd.Series: """Some docs""" return some_library(year, week) def avg_3wk_spend(spend: pd.Series) -> pd.Series: """Some docs""" return spend.rolling(3).mean() def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend / signups def spend_shift_3weeks(spend: pd.Series) -> pd.Series: """Some docs""" return spend.shift(3) def spend_shift_3weeks_per_signup(spend_shift_3weeks: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend_shift_3weeks / signups #sfhamilton #MLOps #machinelearning 47 my_functions.py Can also import functions into other contexts to help debug. e.g. in your REPL: from my_functions import spend_shift_3weeks output = spend_shift_3weeks(...)
  • 48. Collaborating on these functions is easy! def holidays(year: pd.Series, week: pd.Series) -> pd.Series: """Some docs""" return some_library(year, week) def avg_3wk_spend(spend: pd.Series) -> pd.Series: """Some docs""" return spend.rolling(3).mean() def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend / signups def spend_shift_3weeks(spend: pd.Series) -> pd.Series: """Some docs""" return spend.shift(3) def spend_shift_3weeks_per_signup(spend_shift_3weeks: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend_shift_3weeks / signups #sfhamilton #MLOps #machinelearning 48 my_functions.py Easy to assess impact & changes when: ● names mean something ● adding a new input ● changing the name of a function ● adding a brand new function ● deleting a function ⇒ Code reviews are much faster! ⇒ Easy to pick up where others left off!
  • 49. Stitch Fix FED’s Central Feature Definition Store A nice byproduct of using Hamilton! How they use it: 1. Function names follow team convention. a. e.g. D_ prefix indicates date feature #sfhamilton #MLOps #machinelearning 49
  • 50. Stitch Fix FED’s Central Feature Definition Store A nice byproduct of using Hamilton! How they use it: 1. Function names follow team convention. 2. It’s organized into thematic modules, e.g. date_features.py. a. Allows for working on different part of the DAG easily #sfhamilton #MLOps #machinelearning 50
  • 51. Stitch Fix FED’s Central Feature Definition Store A nice byproduct of using Hamilton! How they use it: 1. Function names follow team convention. 2. It’s organized into thematic modules, e.g. date_features.py. 3. It’s in a central repository & versioned by git: a. Can easily find/use/reuse features! b. Can recreate features from different points in time easily. #sfhamilton #MLOps #machinelearning 51
  • 52. FED Testimonials Just incase you don’t believe me #sfhamilton #MLOps #machinelearning 52
  • 53. Testimonial (1) Danielle Q. “the encapsulation of the logic in a single named function makes adding nodes/edges simple to understand, communicate, and transfer knowledge” E.g.: ● Pull Requests are easy to review. ● Onboarding is easy. #sfhamilton #MLOps #machinelearning 53
  • 54. Testimonial (2) Shelly J. “I like how easy-breezy it is to add new nodes/edges to the DAG to support evolving business needs.” E.g. ● new marketing push & we need to add a new feature: ○ this takes minutes, not hours! #sfhamilton #MLOps #machinelearning 54
  • 55. Hamilton @ Stitch Fix FED Impact Summary #sfhamilton #MLOps #machinelearning 55
  • 56. FED Impact Summary With Hamilton, the FED Team gained: ● Naturally testable code. Always. ✅ ● Naturally documentable code. ✅ ● Dataflow visualization for free. ✅ ● Faster debug cycles. ✅ ● A better onboarding & collaboration experience ✅ ○ Central Feature Definition Store as a by product! ✅ ----------------------------------------------------------------------- Total ⚾ Home Run! #sfhamilton #MLOps #machinelearning 56
  • 57. FED Impact Summary With Hamilton, the FED Team gained: ● Naturally testable code. Always. ✅ ● Naturally documentable code. ✅ ● Dataflow visualization for free. ✅ ● Faster debug cycles. ✅ ● A better onboarding & collaboration experience ✅ ○ Central Feature Definition Store as a by product! ✅ ----------------------------------------------------------------------- Total ⚾ Home Run! #sfhamilton #MLOps #machinelearning 57 [claim] By using Hamilton, the FED team can continue to scale their code base, without impacting team productivity [/claim] Question: is that true of your feature code base?
  • 58. Talk Outline: Backstory: who, what, & why Hamilton Hamilton @ Stitch Fix > Pro Tips Extensions
  • 59. Pro Tips - Five things to help you use Hamilton 1. Using it within your own ETL system 2. Migrating to Hamilton 3. Three key things to grok 4. Code organization & python modules 5. Function modifiers #sfhamilton #MLOps #machinelearning 59
  • 60. 1. Using Hamilton within your ETL system ETL Framework compatibility: ● all ETL systems that run python 3.6+. E.g. Airflow ✅ Metaflow ✅ Dagster ✅ Prefect ✅ Kubeflow ✅ etc. ✅ #sfhamilton #MLOps #machinelearning 60
  • 61. 1. Using Hamilton within your ETL system ETL Recipe: 1. Write Hamilton functions & “driver” code. 2. Publish your Hamilton functions in a package, or import via other means (e.g. checkout a repository). 3. Include sf-hamilton as a python dependency 4. Have your ETL system execute your “driver” code. 5. Profit. #sfhamilton #MLOps #machinelearning 61
  • 62. 2. Migrating to Hamilton: (1) CI for comparisons Create a way to easily & frequently compare results. 1. Integrate with continuous integration (CI) system if you can. 2. 🔎🐛 Helps diagnose bugs in your old & new code early & often. #sfhamilton #MLOps #machinelearning 62
  • 63. 2. Migrating to Hamilton: (2) Custom Wrapper Wrap Hamilton in a custom class to match your existing API. 1. When migrating, avoid making too many changes. 2. Allows you to easily insert Hamilton into your context. #sfhamilton #MLOps #machinelearning 63
  • 64. 3. Key Concepts to Grok: (1) Common Index If creating a DataFrame as output: ● Hamilton relies on the series index to join columns properly. Best practice: 1. Load data. 2. Transform/ensure indexes match. 3. Continue with transformations. At Stitch Fix – this meant a common DateTime index. #sfhamilton #MLOps #machinelearning 64
  • 65. 3. Key Concepts to Grok (2): Naming Function Naming: 1. Creates your DAG. 2. Drives collaboration & code reuse. 3. Serves as documentation itself. Key thought: ● Don’t need to get this right the first time ○ Can easily search & replace code as your thinking evolves. ● But it is something to converge thinking on! #sfhamilton #MLOps #machinelearning 65
  • 66. 3. Key Concepts to Grok: (3) Output immutability Functions are only called once: 1. To preserve “immutability” of outputs, don’t mutate passed in data structures. e.g. if you get passed in a pandas series, don’t mutate it. 2. Otherwise YMMV with debugging. Best practice: 1. Test for this in your unit tests! 2. Clearly document mutating inputs if you do. #sfhamilton #MLOps #machinelearning 66
  • 67. 1. Functions are grouped into modules. 2. Modules are used as input to create a DAG. Use this to your development advantage! 1. Use modules to model team thinking, e.g. date_features.py. 2. Helps isolate what you’re working on. 3. Enables you to replace parts of your DAG easily for different contexts. dr = driver.Driver(config_and_initial_data, dates, marketing, spend) 4. Code Organization & Python Modules #sfhamilton #MLOps #machinelearning 67
  • 68. 5. Function Modifiers; a.k.a. decorators The @(...) above a function: ● Hamilton has a bunch to modify function behavior [docs] #sfhamilton #MLOps #machinelearning 68 from hamilton.function_modifiers import extract_columns @extract_columns(*my_list_of_column_names) def load_spend_data(location: str) -> pd.DataFrame: """Some docs""" return pd.read_csv(location, ...) Learn to use them: ● Functionality: ○ e.g. splitting a dataframe into columns ● Keeping code DRY ● FED favorite @config.when
  • 69. Talk Outline: Backstory: who, what, & why Hamilton Hamilton @ Stitch Fix Pro Tips > Extensions
  • 70. Extensions - Why? Initial Hamilton shortcomings: 1. Single threaded. 2. Could not scale to “big data”. 3. Could only produce Pandas DataFrames. 4. Does not leverage all the richness of metadata in the graph. #sfhamilton #MLOps #machinelearning 70
  • 71. Extensions 1. Recent work ○ Scaling Computation ○ Removing the need for pandas ○ “Row based” execution 2. Planned extensions #sfhamilton #MLOps #machinelearning 71
  • 72. Extensions: Recent Work Covering a few things we recently released #sfhamilton #MLOps #machinelearning 72
  • 73. Extensions - Scaling Computation Hamilton grew up with a single core, in memory limitation ● Blocker to adoption for some. Goal: to not modify Hamilton code to scale. E.g. for creating Pandas DFs “it should just work” (on Spark, Dask, Ray, etc.) #sfhamilton #MLOps #machinelearning 73
  • 74. Take this code – and scale it without changing it def holidays(year: pd.Series, week: pd.Series) -> pd.Series: """Some docs""" return some_library(year, week) def avg_3wk_spend(spend: pd.Series) -> pd.Series: """Some docs""" return spend.rolling(3).mean() def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend / signups def spend_shift_3weeks(spend: pd.Series) -> pd.Series: """Some docs""" return spend.shift(3) def spend_shift_3weeks_per_signup(spend_shift_3weeks: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend_shift_3weeks / signups #sfhamilton #MLOps #machinelearning 74 my_functions.py
  • 75. Extensions - Scaling Computation Hamilton grew up with a single core, in memory limitation ● Blocker to adoption for some. Goal: to not modify Hamilton code to scale. ⭐ Lucky for us: ● Hamilton functions are generally very amenable for distributed computation. ● Pandas has a lot of support for scaling. #sfhamilton #MLOps #machinelearning 75
  • 76. Extensions - Scaling Computation What’s in the 1.3.0 Release: ● Experimental versions of Hamilton on: ○ Dask (cores + data) ○ Koalas [Pandas API] on Spark 3.2+ (cores + data) ○ Ray (cores + data*) TL;DR: ● Can scale Pandas** out of the box! #sfhamilton #MLOps #machinelearning 76 * Cluster dependent ** Pandas use & Dask/Koalas dependent
  • 77. Just how easy it is: Example using Dask – only modify the “driver” script from dask.distributed import Client from hamilton import driver from hamilton.experimental import h_dask dag_config = {...} bl_module = importlib.import_module('my_functions') # business logic functions loader_module = importlib.import_module('data_loader') # functions to load data client = Client(...) adapter = h_dask.DaskGraphAdapter(client) dr = driver.Driver(dag_config, bl_module, loader_module, adapter=adapter) output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature'] df = dr.execute(output_columns) # only walk DAG for what is needed #sfhamilton #MLOps #machinelearning 77
  • 78. Extensions - Custom Return Objects What if I don’t want a Pandas dataframe returned? What’s in the 1.3.0 Release: ● Control over what the final object is returned! E.g. ● Dictionary ● Numpy matrix ● Your custom object! #sfhamilton #MLOps #machinelearning 78
  • 79. Just how easy it is: Example Custom Object– only modify “driver” script from dask.distributed import Client from hamilton import driver from hamilton import base dag_config = {...} bl_module = importlib.import_module('my_functions') # business logic functions loader_module = importlib.import_module('data_loader') # functions to load data adapter = base.SimplePythonGraphAdapter(base.DictResult())# or your custom class dr = driver.Driver(dag_config, bl_module, loader_module, adapter=adapter) output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature'] # creates a dict of {col -> function result} result_dict = dr.execute(output_columns) #sfhamilton #MLOps #machinelearning 79
  • 80. Extensions - “Row Based” Execution What if: ● I can’t fit everything into memory? ● Want to reuse my graph and call execute within a for loop with differing input? What’s in the 1.3.0 Release: 1. Enables you to configure the DAG once, 2. Then call `.execute()` with different inputs. Enables data chunking & use cases like image processing or NLP. #sfhamilton #MLOps #machinelearning 80
  • 81. Just how easy it is: Example Row Execution– only modify “driver” script from hamilton import driver config_and_initial_data = {...} module_name = 'my_functions' # e.g. my_functions.py; can instead `import my_functions` module = importlib.import_module(module_name) # The python file to crawl dr = driver.Driver(config_and_initial_data, module) # instantiate driver once. output_columns = ['year','week',...,'spend_shift_3weeks_per_signup','special_feature'] dataset = load_dataset() for data_chunk in dataset: df = dr.execute(output_columns, inputs=data_chunk) # rerun execute on data chunks print(df) #sfhamilton #MLOps #machinelearning 81
  • 82. Extensions - Recent Work Summary Available as of 1.3.0 release: ● Distributed execution ○ Experimental versions of: Dask, Koalas on Spark, Ray ● General purpose framework with custom return objects: ○ Can return [numpy, pytorch, dicts, etc] ● Row based execution! ○ Chunk over large data sets ○ Process things one at at time, e.g. images, text. #sfhamilton #MLOps #machinelearning 82
  • 83. Extensions: Planned Work What we’re thinking about next #sfhamilton #MLOps #machinelearning 83
  • 84. In no particular order: ● Numba integration (github issue) ● Data quality ala pandera (github issue) ● Lineage surfacing tools (github issue) Extensions - Planned Work #sfhamilton #MLOps #machinelearning 84
  • 85. Extensions - Planned Work Numba: - Numba makes your code run much faster. Task: wrap Hamilton functions with numba.jit and compile the graph for speedy execution! E.g. Scale your numpy & simple python code to: ● GPUs ● C/Fortran like speeds! #sfhamilton #MLOps #machinelearning 85
  • 86. Extensions - Planned Work Data Quality: - Runtime inspection of data is a possibility. Task: incorporate expectations, ala Pandera, on functions. e.g. #sfhamilton #MLOps #machinelearning 86 @check_output({'type': float, 'range': (0.0, 10000.0)}) def SOME_IMPORTANT_OUTPUT(input1: pd.Series, input2: pd.Series) -> pd.Series: """Does some complex logic"""
  • 87. Extensions - Planned Work Lineage surfacing tools: - Want to ask questions of the metadata we have Task: provide classes/functions to expose this information. E.g. GDPR/PII questions: ● Where is this PII used and how? Development questions: ● What happens if I change X, what impacts could it have?, etc. #sfhamilton #MLOps #machinelearning 87
  • 88. Extensions - Planned Work Please vote (❤, 👍, etc) for what extensions we should prioritize! #sfhamilton #MLOps #machinelearning 88 https://github.com/stitchfix/hamilton/issues
  • 89. To Conclude Some TL:DRs #sfhamilton #MLOps #machinelearning 89
  • 90. To Conclude 1. Hamilton is a new paradigm to describe data flows. 2. It grew out of a need to tame a feature code base. 3. The Hamilton paradigm can provide teams with multiple productivity improvements & scales with code bases. 4. With the 1.3.0 release it’s now a scalable general purpose framework. #sfhamilton #MLOps #machinelearning 90
  • 91. Thanks for listening – would love your feedback! #sfhamilton #MLOps #machinelearning 91 > pip install sf-hamilton ⭐ on github ☑ create & vote on issues on github 📣 join us on discord (https://discord.gg/wCqxqBqn73)
  • 92. Thank you! Questions? Try out Stitch Fix → goo.gl/Q3tCQ3 @stefkrawczyk linkedin.com/in/skrawczyk