Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Con LA 2022-Pre-recorded - Hamilton, General Purpose framework for Scalable Feature Engineering

Data Con LA 2022-Pre-recorded - Hamilton, General Purpose framework for Scalable Feature Engineering

Elijah ben Izzy, Data Platform Engineer at StitchFix
Data Engineering
At Stitch Fix, data is integral to every facet of our business. We run a plethora of dataflows to transform raw data into features that models use to serve customers. We need to scale these dataflows both in code complexity, as we add new capabilities, and in data size, as we attain more customers. To ensure that these workflows don't devolve into an unmaintainable mess of spaghetti code (chalk full of in-place pandas operations), we built and open-sourced hamilton, a pluggable microframework to make scaling and managing complex dataflows easy. To use hamilton, one creates a dataflow by writing simple python functions in a declarative manner. The framework stitches them together, introducing an abstraction to configure and execute these dataflows. In this talk, we'll present the basic concepts of hamilton, discuss its impact at Stitch Fix, and share recent extensions to the project, including integrations with Dask, Spark, and Ray and why we're excited for its future.

Elijah ben Izzy, Data Platform Engineer at StitchFix
Data Engineering
At Stitch Fix, data is integral to every facet of our business. We run a plethora of dataflows to transform raw data into features that models use to serve customers. We need to scale these dataflows both in code complexity, as we add new capabilities, and in data size, as we attain more customers. To ensure that these workflows don't devolve into an unmaintainable mess of spaghetti code (chalk full of in-place pandas operations), we built and open-sourced hamilton, a pluggable microframework to make scaling and managing complex dataflows easy. To use hamilton, one creates a dataflow by writing simple python functions in a declarative manner. The framework stitches them together, introducing an abstraction to configure and execute these dataflows. In this talk, we'll present the basic concepts of hamilton, discuss its impact at Stitch Fix, and share recent extensions to the project, including integrations with Dask, Spark, and Ray and why we're excited for its future.

Advertisement
Advertisement

More Related Content

More from Data Con LA

Advertisement

Data Con LA 2022-Pre-recorded - Hamilton, General Purpose framework for Scalable Feature Engineering

  1. 1. Hamilton A General Purpose Micro-Framework for Scalable Feature Engineering (in python!) Elijah ben Izzy Data Platform Engineer, Stitch Fix
  2. 2. 2 Data Con 2021 Stitch Fix Shop at your personal curated store. Check out what you like.
  3. 3. 3 Data Con 2021 Data Science @ Stitch Fix algorithms-tour.stitchfix.com - Data Science powers the user experience - 145+ data scientists and data platform engineers
  4. 4. Hamilton is Open Source code > pip install sf-hamilton Get started in <15 minutes! Documentation https://hamilton-docs.gitbook.io/ Lots of examples: https://github.com/stitchfix/hamilton/tree/main/examples 4
  5. 5. What is Hamilton? Why was Hamilton created? Hamilton @ Stitch Fix Hamilton for feature engineering ↳ Scaling humans/teams ↳ Scaling compute/data Plans for the future The Agenda
  6. 6. What is Hamilton? Why was Hamilton created? Hamilton @ Stitch Fix Hamilton for feature engineering ↳ Scaling humans/teams ↳ Scaling compute/data Plans for the future
  7. 7. What is Hamilton? A declarative dataflow paradigm. 7
  8. 8. What is Hamilton? declarative: naturally express as readable code dataflow: represent transformation of any data paradigm: a novel approach to writing python 8
  9. 9. What is Hamilton? declarative: naturally express as readable code dataflow: represent transformation of any data paradigm: a novel approach to writing python 9
  10. 10. What is Hamilton? declarative: naturally express as readable code dataflow: represent transformation of any data paradigm: a novel approach to writing python 10
  11. 11. Code: 11 def holidays(year: pd.Series, week: pd.Series) -> pd.Series: """Some docs""" return some_library(year, week) def avg_3wk_spend(spend: pd.Series) -> pd.Series: """Some docs""" return spend.rolling(3).mean() def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend / signups def spend_shift_3weeks(spend: pd.Series) -> pd.Series: """Some docs""" return spend.shift(3) def spend_shift_3weeks_per_signup(spend_shift_3weeks: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend_shift_3weeks / signups User DAG: Result (e.g. DataFrame): Hamilton User Code → Directed Acyclic Graph → Result
  12. 12. Old Way vs Hamilton Way: Instead of: You declare: 12 df['c'] = df['a'] + df['b'] df['d'] = transform(df['c']) def c(a: pd.Series, b: pd.Series) -> pd.Series: """Sums a with b""" return a + b def d(c: pd.Series) -> pd.Series: """Transforms C to ...""" new_column = _transform_logic(c) return new_column
  13. 13. Instead of: You declare: Inputs == Function Arguments Old Way vs Hamilton Way: 13 df['c'] = df['a'] + df['b'] df['d'] = transform(df['c']) def c(a: pd.Series, b: pd.Series) -> pd.Series: """Sums a with b""" return a + b def d(c: pd.Series) -> pd.Series: """Transforms C to ...""" new_column = _transform_logic(c) return new_column Outputs == Function Name
  14. 14. Full Hello World 14 # feature_logic.py def c(a: pd.Series, b: pd.Series) -> pd.Series: """Sums a with b""" return a + b def d(c: pd.Series) -> pd.Series: """Transforms C to ...""" new_column = _transform_logic(c) return new_column # run.py from hamilton import driver import feature_logic dr = driver.Driver({'a': ..., 'b': ...}, feature_logic) df_result = dr.execute(['c', 'd']) print(df_result) Functions: “Driver” - this actually says what and when to execute:
  15. 15. Hamilton TL;DR: 1. For each transform (=), you write a function(s). 2. Functions declare a DAG. 3. Hamilton handles DAG execution. 15 c d a b # feature_logic.py def c(a: pd.Series, b: pd.Series) -> pd.Series: """Replaces c = a + b""" return a + b def d(c: pd.Series) -> pd.Series: """Replaces d = transform(c)""" new_column = _transform_logic(c) return new_column # run.py from hamilton import driver import feature_logic dr = driver.Driver({'a': ..., 'b': ...}, feature_logic) df_result = dr.execute(['c', 'd']) print(df_result)
  16. 16. What is Hamilton? Why was Hamilton created? Hamilton @ Stitch Fix Hamilton for feature engineering ↳ Scaling humans/teams ↳ Scaling compute/data Plans for the future
  17. 17. Backstory: Time Series Forecasting 17 What Hamilton helped solve! Biggest problems here
  18. 18. Backstory: TS → Dataframe Creation 18 Columns are functions of other columns
  19. 19. Backstory: TS → Dataframe Creation 19 g(f(A,B), …) h(g(f(A,B), …), …) etc🔥
  20. 20. Backstory: TS → DF → 🍝 Code 20 df = load_dates() # load date ranges df = load_actuals(df) # load actuals, e.g. spend, signups df['holidays'] = is_holiday(df['year'], df['week']) # holidays df['avg_3wk_spend'] = df['spend'].rolling(3).mean() # moving average of spend df['spend_per_signup'] = df['spend'] / df['signups'] # spend per person signed up df['spend_shift_3weeks'] = df.spend['spend'].shift(3) # shift spend because ... df['spend_shift_3weeks_per_signup'] = df['spend_shift_3weeks'] / df['signups'] def my_special_feature(df: pd.DataFrame) -> pd.Series: return (df['A'] - df['B'] + df['C']) * weights df['special_feature'] = my_special_feature(df) # ... Now scale this code to 1000+ columns & a growing team 😬 Human scaling 😔: ○ Testing / Unit testing 👎 ○ Documentation 👎 ○ Code Reviews 👎 ○ Onboarding 📈 👎 ○ Debugging 📈 👎 Underrated problem!
  21. 21. What is Hamilton? Why was Hamilton created? Hamilton @ Stitch Fix Hamilton for feature engineering ↳ Scaling humans/teams ↳ Scaling compute/data Plans for the future
  22. 22. Hamilton @ Stitch Fix ● Running in production for 2.5+ years ● Initial use-case manages 4000+ feature definitions ● All feature definitions are: ○ Unit testable ○ Documentation friendly ○ Centrally curated, stored, and versioned in git ● Data science teams ❤ it: ○ Enabled a monthly task to be completed 4x faster ○ Easy to onboard new team members ○ Code reviews are simpler 22
  23. 23. Hamilton @ Stitch Fix 23
  24. 24. What is Hamilton? Why was Hamilton created? Hamilton @ Stitch Fix Hamilton for feature engineering ↳ Scaling humans/teams ↳ Scaling compute/data Plans for the future
  25. 25. Hamilton + Feature Engineering: Overview featurization training inference ● Can model this all in Hamilton (if you wanted to) ● We’ll just focus on featurization ○ FYI: Hamilton works for any object type. ■ Here we’ll assume pandas for simplicity. ○ E.g. use Hamilton within Airflow, Dagster, Prefect, Flyte, Metaflow, Kubeflow, etc. 25 Load Data Transform into Features Fit Model(s) Use Model(s)
  26. 26. data loading & feature code: 26 def holidays(year: pd.Series, week: pd.Series) -> pd.Series: """Some docs""" return some_library(year, week) def avg_3wk_spend(spend: pd.Series) -> pd.Series: """Some docs""" return spend.rolling(3).mean() def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend / signups def spend_shift_3weeks(spend: pd.Series) -> pd.Series: """Some docs""" return spend.shift(3) def spend_shift_3weeks_per_signup(spend_shift_3weeks: pd.Series, signups: pd.Series) -> pd.Series: """Some docs""" return spend_shift_3weeks / signups via driver: feature dataframe: Modeling Featurization features.py run.py
  27. 27. Code that needs to be written: 1. Functions to load data a. normalize/create common index to join on 2. Transform/feature functions a. Optional: model functions 3. Drivers materialize data a. DAG is walked for only what’s needed Modeling Featurization 27 Data Loaders Feature Functions Drivers
  28. 28. Code that needs to be written: 1. Functions to load data a. normalize/create common index to join on 2. Transform/feature functions a. Optional: model functions 3. Drivers materialize data a. DAG is walked for only what’s needed Modeling Featurization 28 Data Loaders Drivers Feature Functions
  29. 29. Q: Why is feature engineering hard?
  30. 30. Q: Why is feature engineering hard? A: Scaling!
  31. 31. Scaling: Served Two Ways } } > Human/team: ● Highly coupled code ● Difficulty reusing/understand work ● Broken/unhealthy production pipelines > Machine: ● Data is too big to fit in memory ● Cannot easily parallelize computation 31 Hamilton helps here! Hamilton has integrations here!
  32. 32. What is Hamilton? Why was Hamilton created? Hamilton @ Stitch Fix Hamilton for feature engineering ↳ Scaling humans/teams ↳ Scaling compute/data Plans for the future
  33. 33. How Hamilton Helps with Human/Team Scaling 33 Highly coupled code Decouples “functions” from use (driver code).
  34. 34. How Hamilton Helps with Human/Team Scaling 34 Highly coupled code Decouples “functions” from use (driver code). Difficulty reusing/understand work Functions are curated into modules. Everything is unit testable. Documentation is natural. Forced to align on naming.
  35. 35. How Hamilton Helps with Human/Team Scaling 35 Highly coupled code Decouples “functions” from use (driver code). Difficulty reusing/understand work Functions are curated into modules. Everything is unit testable. Documentation is natural. Forced to align on naming. Broken/unhealthy production pipelines Debugging is straightforward. Easy to version features via git/packaging. Runtime data quality checks.
  36. 36. Hamilton functions: Scaling Humans/Teams 36 # client_features.py @tag(owner='Data-Science', pii='False') @check_output(data_type=np.float64, range=(-5.0, 5.0), allow_nans=False) def height_zero_mean_unit_variance(height_zero_mean: pd.Series, height_std_dev: pd.Series) -> pd.Series: """Zero mean unit variance value of height""" return height_zero_mean / height_std_dev Hamilton Features: ● Unit testing ✅ always possible ● Documentation ✅ tags, visualization, function doc ● Modularity/reuse ✅ module curation & drivers ● Central feature definition store ✅ naming, curation, versioning ● Data quality ✅ runtime checks
  37. 37. Code base implications: 1. Functions are always in modules. 2. Driver script, i.e execution script, is decoupled from functions. Scaling Humans/Teams 37 Module spend_features.py Module marketing_features.py Module customer_features.py Driver script 1 > Code reuse from day one! > Low maintenance to support many driver scripts Driver script 2 Driver script 3
  38. 38. What is Hamilton? Why was Hamilton created? Hamilton @ Stitch Fix Hamilton for feature engineering ↳ Scaling humans/teams ↳ Scaling compute/data Plans for the future
  39. 39. Scaling Compute/Data with Hamilton Hamilton has the following integrations out of the box: ● Ray ○ Single process -> Multiprocessing -> Cluster ● Dask ○ Single process -> Multiprocessing -> Cluster ● Pandas on Spark ○ Uses enables using Pandas Spark API with your Pandas code easily ● Switching to run on Ray/Dask/Pandas on Spark requires: > Only changing driver.py code* > Pandas on Spark also needs changing how data is loaded. 39
  40. 40. Scaling Hamilton: Driver Only Change 40 # run.py from hamilton import driver import data_loaders import date_features import spend_features config = {...} # config, e.g. data_location dr = driver.Driver(config, data_loaders, date_features, spend_features) features_wanted = [...] # choose subset wanted feature_df = dr.execute(features_wanted) save(feature_df, 'prod.features')
  41. 41. Hamilton + Ray: Driver Only Change 41 # run.py from hamilton import driver import data_loaders import date_features import spend_features config = {...} # config, e.g. data_location dr = driver.Driver(config, data_loaders, date_features, spend_features) features_wanted = [...] # choose subset wanted feature_df = dr.execute(features_wanted) save(feature_df, 'prod.features') # run_on_ray.py … from hamilton import base, driver from hamilton.experimental import h_ray … ray.init() config = {...} rga = h_ray.RayGraphAdapter( result_builder=base.PandasDataFrameResult()) dr = driver.Driver(config, data_loaders, date_features, spend_features, adapter=rga) features_wanted = [...] # choose subset wanted feature_df = dr.execute(features_wanted, inputs=date_features) save(feature_df, 'prod.features') ray.shutdown()
  42. 42. Hamilton + Dask: Driver Only change 42 # run.py from hamilton import driver import data_loaders import date_features import spend_features config = {...} # config, e.g. data_location dr = driver.Driver(config, data_loaders, date_features, spend_features) features_wanted = [...] # choose subset wanted feature_df = dr.execute(features_wanted) save(feature_df, 'prod.features') # run_on_dask.py … from hamilton import base, driver from hamilton.experimental import h_dask … client = Client(Cluster(...)) # dask cluster/client config = {...} dga = h_dask.DaskGraphAdapter(client, result_builder=base.PandasDataFrameResult()) dr = driver.Driver(config, data_loaders, date_features, spend_features, adapter=dga) features_wanted = [...] # choose subset wanted feature_df = dr.execute(features_wanted, inputs=date_features) save(feature_df, 'prod.features') client.shutdown()
  43. 43. Hamilton + Spark: Driver Change + loader 43 # run.py from hamilton import driver import data_loaders import date_features import spend_features config = {...} # config, e.g. data_location dr = driver.Driver(config, data_loaders, date_features, spend_features) features_wanted = [...] # choose subset wanted feature_df = dr.execute(features_wanted) save(feature_df, 'prod.features') # run_on_pandas_on_spark.py … import pyspark.pandas as ps from hamilton import base, driver from hamilton.experimental import h_spark … spark = SparkSession.builder.getOrCreate() ps.set_option(...) config = {...} skga = h_dask.SparkKoalasGraphAdapter(spark, spine='COLUMN_NAME', result_builder=base.PandasDataFrameResult()) dr = driver.Driver(config, spark_data_loaders, date_features,spend_features, adapter=skga) features_wanted = [...] # choose subset wanted feature_df = dr.execute(features_wanted, inputs=date_features) save(feature_df, 'prod.features') spark.stop()
  44. 44. Hamilton + Ray/Dask: How Does it Work? 44 # FUNCTIONS def c(a: pd.Series, b: pd.Series) -> pd.Series: """Sums a with b""" return a + b def d(c: pd.Series) -> pd.Series: """Transforms C to ...""" new_column = _transform_logic(c) return new_column # DRIVER from hamilton import base, driver from hamilton.experimental import h_ray … ray.init() config = {...} rga = h_ray.RayGraphAdapter( result_builder=base.PandasDataFrameResult()) dr = driver.Driver(config, data_loaders, date_features, spend_features, adapter=rga) features_wanted = [...] # choose subset wanted feature_df = dr.execute(features_wanted, inputs=date_features) save(feature_df, 'prod.features') ray.shutdown() # DAG
  45. 45. Hamilton + Ray/Dask: How Does it Work? 45 # FUNCTIONS def c(a: pd.Series, b: pd.Series) -> pd.Series: """Sums a with b""" return a + b def d(c: pd.Series) -> pd.Series: """Transforms C to ...""" new_column = _transform_logic(c) return new_column # DRIVER from hamilton import base, driver from hamilton.experimental import h_ray … ray.init() config = {...} rga = h_ray.RayGraphAdapter( result_builder=base.PandasDataFrameResult()) dr = driver.Driver(config, data_loaders, date_features, spend_features, adapter=rga) features_wanted = [...] # choose subset wanted feature_df = dr.execute(features_wanted, inputs=date_features) save(feature_df, 'prod.features') ray.shutdown() # Delegate to Ray/Dask … ray.remote( node.callable).remote(**kwargs) —---—---—---—---—--- dask.delayed(node.callable)(**kwargs) # DAG
  46. 46. Hamilton + Spark: How Does it Work? 46 # FUNCTIONS def c(a: pd.Series, b: pd.Series) -> pd.Series: """Sums a with b""" return a + b def d(c: pd.Series) -> pd.Series: """Transforms C to ...""" new_column = _transform_logic(c) return new_column # DRIVER from hamilton import base, driver from hamilton.experimental import h_ray … ray.init() config = {...} rga = h_ray.RayGraphAdapter( result_builder=base.PandasDataFrameResult()) dr = driver.Driver(config, data_loaders, date_features, spend_features, adapter=rga) features_wanted = [...] # choose subset wanted feature_df = dr.execute(features_wanted, inputs=date_features) save(feature_df, 'prod.features') ray.shutdown() # With Spark … Change these to load Spark “Pandas” equivalent object instead. Spark will take care of the rest. # DAG
  47. 47. Hamilton + Ray/Dask/Pandas on Spark: Caveats Things to think about: 1. Serialization: a. Hamilton defaults to serialization methodology of these frameworks. 2. Memory: a. Defaults should work. But fine tuning memory on a “function” basis is not exposed. 3. Python dependencies: a. You need to manage them. 4. Looking to graduate these APIs from experimental status >> Looking for contributions here to extend support in Hamilton! << 47
  48. 48. What is Hamilton? Why was Hamilton created? Hamilton @ Stitch Fix Hamilton for feature engineering ↳ Scaling humans/teams ↳ Scaling compute/data Plans for the future
  49. 49. Hamilton: Plans for the Future 49 ● Async support for an online context ● Improvements to/configurability around runtime data checks ● Integration with external data sources ● Integrations with more OS compute frameworks Modin · Ibis · FlyteKit · Metaflow · Dagster · Prefect ● <your idea here>
  50. 50. Summary: Hamilton for Feature/Data Engineering ● Hamilton is a declarative paradigm to describe data/feature transformations. ○ Embeddable anywhere that runs python. ● It grew out of a need to tame a feature code base. ○ It’ll make yours better too! ● The Hamilton paradigm scales humans/teams through software engineering best practices. ● Hamilton + Ray/Dask/Pandas on Spark enables one to: scale humans/teams and scale data/compute. ● Hamilton is going exciting places! 50
  51. 51. Give Hamilton a Try! We’d love your feedback > pip install sf-hamilton ⭐ on github (https://github.com/stitchfix/hamilton) ☑ create & vote on issues on github 📣 join us on on Slack (https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1bjs72asx-wcUTgH7q7QX1igiQ5bbdcg) 51
  52. 52. Thank you. Questions? https://twitter.com/elijahbenizzy https://www.linkedin.com/in/elijahbenizzy/ https://github.com/stitchfix/hamilton elijah.benizzy@stitchfix.com

×