Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Modularized ETL
Writing with Spark
Neelesh Salian
Software Engineer - Stitch Fix
May 27, 2021
whoami
Neelesh Salian
Software Engineer - Data Platform
Agenda
▪ What is Stitch Fix?
▪ Apache Spark @ Stitch Fix
▪ Spark Writer Modules
▪ Learnings & Future Work
What is Stitch Fix?
What does the company do?
Stitch Fix is a personalized styling service
Shop at your personal curated store. Check out what you like.
Data Science is behind everything we do
algorithms-tour.stitchfix.com
• Algorithms org
• 145+ Data Scientists and Platform ...
Apache Spark @ Stitch Fix
How we use Spark in our teams?
Spark @ Stitch Fix - History and Current State
▪ Spark was introduced to enhance and
scale ETL capabilities (circa 2016)
▪...
Spark @ Stitch Fix - Spark Tooling
• Spark Sql + Pyspark + Scala
• Containerized Spark driver + AWS EMR (for compute)
• Us...
Spark @ Stitch Fix - Writing data to the warehouse
Spark @ Stitch Fix - Steps while writing data
At the start, and even today, writing data through the writer library
has th...
Spark @ Stitch Fix - Data Versioning
• Writing into a Partitioned Table (e.g partitioned by a date_column
for a date value...
Since we have a single path to
validate and write to the Data
Warehouse, what other
common functionality could
we add to p...
Spark Writer Modules
Config driven transformations while writing data to the Data Warehouse
Spark Writer Modules - Adding modules
Adding them as transformations in the writer library was
straightforward. In additio...
Spark Writer Modules - 3 Modules
• Journalizer
• Data Cleanser
• Data Quality Checker
The 3 modules we built
Journalizer
Journalizing - Data can change
Example: Data about a client has the potential to change and we need to capture it
Note: Th...
Journalizing - 2 ways of capturing historical information
▪ Record of all data - written daily and
partitioned by date
▪ C...
client_id favorite_color dress_style date
(partition_column)
10 blue formal 2021-05-20
10 blue formal 2021-05-21
10 black ...
Given the compressed nature of
Journal tables, we moved
historical data into them.
A Journal table is meant to be a
ledger...
Journalizing - How do we create a journal table?
Some questions we asked ourselves:
1. How could we get easy access to lat...
client_id favorite_color date
10 blue 2021-05-20
10 blue 2021-05-21
10 blue 2021-05-22
10 purple 2021-05-23
Compression/ D...
client_id favorite_color date
10 blue 2021-05-20
10 blue 2021-05-21
Current Pointer
Partition
client_id favorite_color sta...
Journalizing - Process of Journalizing
1. User creates a Journal table and sets a field to track using
metadata e.g. (clien...
Journalizing - The workflow
Journalizing - Journal Table Pros & Cons
▪ De-duped data
▪ Two partitions for easy querying -
is_current = 1 (latest data)...
Data Cleanser
Data Cleanser - What and why?
Data can be old or un-referenced or meant to be excluded.
• How do we make sure some record ...
Data Cleanser - What does cleansing mean?
Let’s say we wish to nullify/delete some column values in a table
id column_a co...
Data Cleanser - Criteria
1. Has to be configurable
2. Users should be able to specify the key to be monitored and
columns f...
Data Cleanser - How?
• How?
• Perform cleansing at write time to ensure all future records are cleansed
despite the source...
Data Cleanser - Implementation
We have a metadata infrastructure that
allows users to add metadata to their
owned tables
▪...
Data Cleanser - The workflow
1. User specifics metadata configuration for cleansing in a Hive
table
metadata = {"key": "id",
...
Data Cleanser - The workflow
Data Quality Checker
Data Quality - Background
• How do we detect errors or skews in data?
• When do we check for data problems?
• How do Data ...
Data Quality - What do we need to check data?
• Service to initialize tests and run tests on Hive tables.
• Mechanism that...
Data Quality - What would a Test look like?
• NullCount(column_name)
• Is the null count on this column higher than “value...
Data Quality - How we built it?
• Built a service that was equipped to:
• Enable CRUD operations on tests for Hive tables
...
Data Quality - Surfacing Data Quality to users
1. The data quality service had a python client that helped users
run CRUD ...
Spark Writer Modules - Transformations in code
def writeDataFrame(inputDataframe:DataFrame,
databaseName: String,
tableNam...
Learnings & Future Work
What we learnt and where are we headed?
Learnings & Future Work - Lessons learnt
• Adding new modules meant more complexity to the write pipeline, but
each step w...
Learnings & Future Work - Future Work
Now, additional modules can easily be added in a similar fashion
• Data Quality is b...
Summary
TL;DR:
Summary
Writing data with Spark @ Stitch Fix:
• We have a singular write path to input data into the warehouse driven
by S...
Thank you.
Questions?
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1

Share

Modularized ETL Writing with Apache Spark

Download to read offline

Apache Spark has been an integral part of Stitch Fix’s compute infrastructure. Over the past five years, it has become our de facto standard for most ETL and heavy data processing needs and expanded our capabilities in the Data Warehouse.

Since all our writes to the Data Warehouse are through Apache Spark, we took advantage of that to add more modules that supplement ETL writing. Config driven and purposeful, these modules perform tasks onto a Spark Dataframe meant for a destination Hive table.

These are organized as a sequence of transformations on the Apache Spark dataframe prior to being written to the table.These include a process of journalizing. It is a process which helps maintain a non-duplicated historical record of mutable data associated with different parts of our business.

Data quality, another such module, is enabled on the fly using Apache Spark. Using Apache Spark we calculate metrics and have an adjacent service to help run quality tests for a table on the incoming data.

And finally, we cleanse data based on provided configurations, validate and write data into the warehouse. We have an internal versioning strategy in the Data Warehouse that allows us to know the difference between new and old data for a table.

Having these modules at the time of writing data allows cleaning, validation and testing of data prior to entering the Data Warehouse thus relieving us, programmatically, of most of the data problems. This talk focuses on ETL writing in Stitch Fix and describes these modules that help our Data Scientists on a daily basis.

Modularized ETL Writing with Apache Spark

  1. 1. Modularized ETL Writing with Spark Neelesh Salian Software Engineer - Stitch Fix May 27, 2021
  2. 2. whoami Neelesh Salian Software Engineer - Data Platform
  3. 3. Agenda ▪ What is Stitch Fix? ▪ Apache Spark @ Stitch Fix ▪ Spark Writer Modules ▪ Learnings & Future Work
  4. 4. What is Stitch Fix? What does the company do?
  5. 5. Stitch Fix is a personalized styling service Shop at your personal curated store. Check out what you like.
  6. 6. Data Science is behind everything we do algorithms-tour.stitchfix.com • Algorithms org • 145+ Data Scientists and Platform engineers • 3 main verticals + platform
  7. 7. Apache Spark @ Stitch Fix How we use Spark in our teams?
  8. 8. Spark @ Stitch Fix - History and Current State ▪ Spark was introduced to enhance and scale ETL capabilities (circa 2016) ▪ Starting version: 1.2.x ▪ Spark SQL was the dominant use case ▪ Used for reading and writing data into the warehouse as Hive Tables. ▪ Current Version: 2.4.x, 3.1.x [ prototyping] ▪ For all ETL reads and writes, production and test ▪ Spark serves regular pyspark,sql and scala jobs, notebooks & pandas-based readers - writers ▪ Controls all writing with more functionality [this talk] How it’s going How it started
  9. 9. Spark @ Stitch Fix - Spark Tooling • Spark Sql + Pyspark + Scala • Containerized Spark driver + AWS EMR (for compute) • Used for production and staging ETL by Data Scientists • Notebooks • Jupyterhub setup with Stitch Fix libraries and python packages pre-installed. • Used by Data Scientists to test and prototype • Pandas-based Readers - Writers • Reads and writes data using pandas dataframes • No bootstrap time for Spark jobs - uses Apache Livy for execution • Used for test + production All the tooling available to Data Scientists to use Spark to read and write data
  10. 10. Spark @ Stitch Fix - Writing data to the warehouse
  11. 11. Spark @ Stitch Fix - Steps while writing data At the start, and even today, writing data through the writer library has these steps. 1. Validation - check dataframe for type matches, schema matches to the Hive table, overflow type checks. 2. Writing the data into files in S3 - parquet or text format based on the Hive table’s configuration 3. Update the Hive Metastore - with versioning scheme for data.
  12. 12. Spark @ Stitch Fix - Data Versioning • Writing into a Partitioned Table (e.g partitioned by a date_column for a date value of 20210527) • s3:<bucket>/<hive_db_name>/<hive_table_name>/date_column=20210527/batch_id=epoch_ts • Writing into a Non-Partitioned Table • s3:<bucket>/<hive_db_name>/<hive_table_name>/batch_id=epoch_ts We also add the latest write_timestamp to the Hive table metadata, to indicate when the last write was done to the table. Writing data into the Data Warehouse with versioning to distinguish old vs new data. We add the epoch_timestamp of the write time to indicate the freshness of the data.
  13. 13. Since we have a single path to validate and write to the Data Warehouse, what other common functionality could we add to provide more value to our Data Scientists?
  14. 14. Spark Writer Modules Config driven transformations while writing data to the Data Warehouse
  15. 15. Spark Writer Modules - Adding modules Adding them as transformations in the writer library was straightforward. In addition, we had to: • Make each module configurable via spark properties • Make each module behave the same for every write pipeline • Make them configurable to either block writing data or not in the event of failure • Add documentation for each module to help steer Data Scientists How do we add additional functionality to the writing pipeline behind the scenes?
  16. 16. Spark Writer Modules - 3 Modules • Journalizer • Data Cleanser • Data Quality Checker The 3 modules we built
  17. 17. Journalizer
  18. 18. Journalizing - Data can change Example: Data about a client has the potential to change and we need to capture it Note: These are Slowly Changing Dimensions (Type 2) - where we preserve the old values. client_id favorite_color dress_style 10 blue formal Current on Date: 2021-05-21 client_id favorite_color dress_style 10 black formal Current on Date: 2021-05-22 client_id favorite_color dress_style 10 green formal Current on Date: 2021-07-23 client_id favorite_color dress_style 10 purple formal Current on Date: 2021-05-23
  19. 19. Journalizing - 2 ways of capturing historical information ▪ Record of all data - written daily and partitioned by date ▪ Contains all records - duplicated across partitions ▪ Difficult to find nuanced information or track changes in data by date since all the data is included. ▪ Harder to access the data because of the size of the table ▪ Compressed, de-duped information ▪ Two partitions: is_current = 1 (latest data) & is_current = 0 (old data) ▪ Tracks changing values by timestamp. e.g sets start and end date to a value to show duration of validity ▪ Sorted for easy access by primary key Journal Tables History Tables 2 types of Hive Tables to store this information.
  20. 20. client_id favorite_color dress_style date (partition_column) 10 blue formal 2021-05-20 10 blue formal 2021-05-21 10 black formal 2021-05-21 10 blue formal 2021-05-22 10 black formal 2021-05-22 10 purple formal 2021-05-22 ….. ….. ….. ……. 10 blue formal 2021-07-23 10 black formal 2021-07-23 10 purple formal 2021-07-23 10 green formal 2021-07-23 History Table Journal Table client_id favorite_color start_date end_date is_current (partition column) 10 blue 2021-01-01 (first time recorded) 2021-05-20 0 10 black 2021-05-21 2021-05-21 0 10 purple 2021-05-22 2021-07-22 0 10 green 2021-07-23 2999-01-01 (default end time) 1 Note: Tracking changes to favorite_color across time
  21. 21. Given the compressed nature of Journal tables, we moved historical data into them. A Journal table is meant to be a ledger of the change in values and a pointer to the current values. Let’s now look at how Journal tables are created.
  22. 22. Journalizing - How do we create a journal table? Some questions we asked ourselves: 1. How could we get easy access to latest information about a particular key? 2. How can information be compressed and de-duplicated? 3. Can we determine - how long was the favorite_color set to <value>? 4. But, how do we update the table each time to maintain this ordering? 5. Where and when do we run this process of conversion? What we need to get to the table structure?
  23. 23. client_id favorite_color date 10 blue 2021-05-20 10 blue 2021-05-21 10 blue 2021-05-22 10 purple 2021-05-23 Compression/ De-dupe client_i d favorite_color start_date end_date 10 blue 2021-01-01 (first time recorded) 2021-05-22 10 purple 2021-05-23 2999-01-01 (default end time) Start date when value was valid End date when value was valid Symbolizing the latest value without a specified end
  24. 24. client_id favorite_color date 10 blue 2021-05-20 10 blue 2021-05-21 Current Pointer Partition client_id favorite_color start_date end_date is_current 10 blue 2021-01-01 (first time recorded) 2999-01-01 (default end time) 1 In a history table, we don’t know the changed value since it’s not marked. client_id favorite_color date 10 blue 2021-05-20 10 blue 2021-05-21 10 blue 2021-05-22 10 purple 2021-05-22 client_id favorite_color start_date end_date is_current 10 blue 2021-01-01 (first time recorded) 2021-05-21 0 10 purple 2021-05-22 2999-01-01 (default end time) 1 Current Pointer Partition purple is now marked as the current value, and blue is moved to the older partition
  25. 25. Journalizing - Process of Journalizing 1. User creates a Journal table and sets a field to track using metadata e.g. (client_id is set as primary key) 2. When data is written to this table, the table is reloaded in its entirety and we perform a. Deduplication and compression b. Set the current values in partitions - if there are changes c. Sort the table based on the date 3. Rewrite this new DataFrame into the table
  26. 26. Journalizing - The workflow
  27. 27. Journalizing - Journal Table Pros & Cons ▪ De-duped data ▪ Two partitions for easy querying - is_current = 1 (latest data) & is_current = 0 (old data). Data pipeline needs to access only 1 partition for all the latest values. ▪ Compressed and timestamp to indicate field values lifespan to track changes ▪ Sorted for easy access by primary key ▪ Complicated process and multiple steps prior to writing. ▪ Rewriting the table is a must to maintain the rules of compression and deduplication Cons Pros
  28. 28. Data Cleanser
  29. 29. Data Cleanser - What and why? Data can be old or un-referenced or meant to be excluded. • How do we make sure some record values don’t continue to persist in a table? • How do we delete records or nullify them consistently throughout the warehouse? • Can this be configured by the Data Scientists to apply to their table? Can we cleanse data based on a configuration?
  30. 30. Data Cleanser - What does cleansing mean? Let’s say we wish to nullify/delete some column values in a table id column_a column_b color style 9 value_a “string_field_1” blue formal 10 value_a1 “string_field_2” red casual 11 value_a2 “string_field_3” white formal OR Nullified Deleted id column_a column_b color style 9 null null blue formal 10 null null red casual 11 null null white formal id column_a column_b color style 9 <empty> <empty> blue formal 10 <empty> <empty> red casual 11 <empty> <empty> white formal
  31. 31. Data Cleanser - Criteria 1. Has to be configurable 2. Users should be able to specify the key to be monitored and columns for cleansing 3. At least, two treatments should be available: a. nullify b. delete 4. Should happen to data at write and/or at rest What does the cleanser have to do?
  32. 32. Data Cleanser - How? • How? • Perform cleansing at write time to ensure all future records are cleansed despite the source having included them. • Separately, cleanse the entire Hive table of the data is not used - to make sure older partitions don’t have the un-referenced data. • What do we need? • A mechanism to configure what to cleanse - nullify/delete per table • This mechanism needs to be accessible at write / rest to run the cleansing on the data. How do we cleanse data?
  33. 33. Data Cleanser - Implementation We have a metadata infrastructure that allows users to add metadata to their owned tables ▪ Hive tables have metadata fields that can be used to store auxiliary information about them ▪ The cleanser could simply access the tables metadata and perform cleansing accordingly. Each table could have a configuration naming columns like [column_a, column_b] that needed to be cleansed along with the treatment. ▪ Reacting to the specified metadata meant the cleanser module could work as configured at all times. ▪ The same module could perform cleansing for data while writing and/or at rest. Cleansing Table Configuration
  34. 34. Data Cleanser - The workflow 1. User specifics metadata configuration for cleansing in a Hive table metadata = {"key": "id", "treatment": "nullify", "columns": ["column_a", "column_b"]]} 2. Cleanser reads the table and checks all the columns that match 3. Performs nullify/delete on the DataFrame and proceeds to the next transformation or writes this cleansed DataFrame to the Data warehouse. How does it come together?
  35. 35. Data Cleanser - The workflow
  36. 36. Data Quality Checker
  37. 37. Data Quality - Background • How do we detect errors or skews in data? • When do we check for data problems? • How do Data Scientists setup Data Quality checks? What motivated the data quality initiative?
  38. 38. Data Quality - What do we need to check data? • Service to initialize tests and run tests on Hive tables. • Mechanism that calculates metrics based on the configured tests on the data prior to writing it to the warehouse • Interface that allows users to autonomously setup Data quality and run tests on their pipelines. What components were needed for running data quality checks?
  39. 39. Data Quality - What would a Test look like? • NullCount(column_name) • Is the null count on this column higher than “value”? • Average(column_name) • Is the average below what is expected? • Max(column_name) • Is the max value for this column exceeding a certain limit? • RowCount(table) • Are we suddenly writing more rows than anticipated? Some examples of tests that we started off with.
  40. 40. Data Quality - How we built it? • Built a service that was equipped to: • Enable CRUD operations on tests for Hive tables • Had the ability to run tests on metrics when triggered • At the same time, we built in the ability to calculate metrics in a module in the Spark writer library. • This module interacted with the data quality service to find the metrics that were needed to be calculated. • Ran these calculations in Spark on the input DataFrame - e.g. average (column_name) • Triggered tests on these metrics and posted the results to the user. Putting the components together
  41. 41. Data Quality - Surfacing Data Quality to users 1. The data quality service had a python client that helped users run CRUD operations on tests 2. The writer module could be configured to run on a write operation for a table. a. Setting spark.enable.data.quality.checks=true in Spark properties helped run these tests at write time. 3. Separately, we created an offline mode to run tests on already written data, if the user doesn’t wish to block writes to the table. What did the interface look like?
  42. 42. Spark Writer Modules - Transformations in code def writeDataFrame(inputDataframe:DataFrame, databaseName: String, tableName: String) = { // Validation val validatedDataframe = sfWriter.validateDataframe(inputDataframe,databaseName,tableName) // Journalizing val journalizedDataframe = sfWriter.journalizeDataframe(validatedDataframe,databaseName,tableName) // Data Cleanser val cleansedDataframe = sfWriter.dataCleanser(journalizedDataframe,databaseName,tableName) // Data Quality Checker sfWriter.dataQualityChecker(cleansedDataframe,databaseName,tableName) // Write to the Data Warehouse + Update Metastore sfWriter.writeToS3(cleansedDataframe,databaseName,tableName) }
  43. 43. Learnings & Future Work What we learnt and where are we headed?
  44. 44. Learnings & Future Work - Lessons learnt • Adding new modules meant more complexity to the write pipeline, but each step was doing a valuable transformation • Making each transformation performant and efficient was a top priority when each module was being created. • Testing - unit & integration was key in rolling out without mishaps • Introducing these modules to Data Scientists meant we needed better communication and more documentation • Getting data quality checks to run efficiently was a challenge, since we had to programmatically calculate the partitions of the DataFrame and run tests against each potential Hive partition. This took some effort to run smoothly. By adding modularized transformations to data, what changed and how did we adapt?
  45. 45. Learnings & Future Work - Future Work Now, additional modules can easily be added in a similar fashion • Data Quality is being enhanced with support for customized testing rather than simple threshold or values. • The goal is to have Data quality ingrained in the ETL process of our Data Science workflows. • Journalizer and data cleansing are mostly static but we are exploring alternate solutions to help augment and delete records more efficiently. By adding modularized transformations to data, what changed and how did we adapt?
  46. 46. Summary TL;DR:
  47. 47. Summary Writing data with Spark @ Stitch Fix: • We have a singular write path to input data into the warehouse driven by Spark • 3 modules that perform transformations are config driven and available at the time of write. • Journalizing: Writing a non-duplicated historical record of data to help quick access and compression. • Data Cleanser: Delete or nullify values based on table configuration. • Data Quality: Enabling the calculation of metrics and running tests on incoming data into the warehouse.
  48. 48. Thank you. Questions?
  • NeeleshSalian

    Sep. 5, 2021

Apache Spark has been an integral part of Stitch Fix’s compute infrastructure. Over the past five years, it has become our de facto standard for most ETL and heavy data processing needs and expanded our capabilities in the Data Warehouse. Since all our writes to the Data Warehouse are through Apache Spark, we took advantage of that to add more modules that supplement ETL writing. Config driven and purposeful, these modules perform tasks onto a Spark Dataframe meant for a destination Hive table. These are organized as a sequence of transformations on the Apache Spark dataframe prior to being written to the table.These include a process of journalizing. It is a process which helps maintain a non-duplicated historical record of mutable data associated with different parts of our business. Data quality, another such module, is enabled on the fly using Apache Spark. Using Apache Spark we calculate metrics and have an adjacent service to help run quality tests for a table on the incoming data. And finally, we cleanse data based on provided configurations, validate and write data into the warehouse. We have an internal versioning strategy in the Data Warehouse that allows us to know the difference between new and old data for a table. Having these modules at the time of writing data allows cleaning, validation and testing of data prior to entering the Data Warehouse thus relieving us, programmatically, of most of the data problems. This talk focuses on ETL writing in Stitch Fix and describes these modules that help our Data Scientists on a daily basis.

Views

Total views

87

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

6

Shares

0

Comments

0

Likes

1

×