Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club

3,159 views

Published on

Abstract:-

Data engineering at Dollar Shave Club has grown significantly over the last year. In that time, it has expanded in scope from conventional web-analytics and business intelligence to include real-time, big data and machine learning applications. We have bootstrapped a dedicated data engineering team in parallel with developing a new category of capabilities. And the business value that we delivered early on has allowed us to forge new roles for our data products and services in developing and carrying out business strategy. This progress was made possible, in large part, by adopting Apache Spark as an application framework. This talk describes what we have been able to accomplish using Spark at Dollar Shave Club.

Bio:-

Brett Bevers, Ph.D. Brett is a backend engineer and leads the data engineering team at Dollar Shave Club. More importantly, he is an ex-academic who is driven to understand and tackle hard problems. His latest challenge has been to develop tools powerful enough to support data-driven decision making in high value projects.

Published in: Technology
  • Be the first to comment

Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club

  1. 1. Joining The Club A c c e l e r a t i n g B i g D a t a w i t h A p a c h e S p a r k Dollar Shave Club
  2. 2. Background on DSC Engineering at DSC Growth of Data Team Show & Tell: Machine Learning Pipeline Outline
  3. 3. A David and Goliath Story Introduction
  4. 4. of new members Goliath
  5. 5. Engineering at DSC u  Frontend u  Ember.js web apps u  iOS and Android apps u  HTML email u  Backend u  Ruby on Rails web backends u  Internal services (Ruby, Node.js, Golang, Python, Elixir) u  Data and analytics (Python, SQL, Spark) u  QA u  CircleCI, SauceLabs, Jenkins u  TestUnit, Selenium u  IT u  Office and warehouse IT
  6. 6. Engineering at DSC highscalability.com
  7. 7. Data Engineering at DSC A David and Big Data Story
  8. 8. Big Data What is the barrier to entry?
  9. 9. Big Data What is the barrier to entry? u  Requires a different set of capabilities
  10. 10. Big Data What is the barrier to entry? u  Requires a different set of capabilities u  Investing resources without an obvious ROI
  11. 11. Big Data What is the barrier to entry? u  Requires a different set of capabilities u  Investing resources without an obvious ROI Knowing where to start
  12. 12. Good Foundations
  13. 13. Data Engineering u  Machine learning pipeline u  Models served in production u  Exploratory Analysis u  Customer segmentation (clustering) u  Hypothesis testing u  Data mining u  NLP (topic modeling)
  14. 14. Data Engineering u  Maxwell + Kafka + Spark Streaming u  Streaming data replication u  Streaming metrics directly from the data layer
  15. 15. Anatomy of a Machine Learning Pipeline
  16. 16. Box Manager Email
  17. 17. Box Manager Email Problem: Order the product tiles in “Box Manager Email” to maximize profit Constraints: u  Every customer sees some ordered set of products u  Do not show products already added to box
  18. 18. Box Manager Email Problem: Order the product tiles in “Box Manager Email” to maximize profit Constraints: u  Every customer sees some ordered set of products u  Do not show products already added to box +25% revenue per email open
  19. 19. Strategy For each product, model the behavior which best distinguishes someone who buys that product from someone who buys other products; rank a product by the strength of the indicative behavior, when present, and rank a product randomly otherwise Model u  Logistic Regression u  Learns the “tipping point” between success and failure u  Success = “buys product X”
  20. 20. Design u  Extract data from data warehouse (Redshift) u  Join that data with hand-curated metadata (knowledge base) u  Aggregate and pivot events by customer and discretized time u  Generate a training set of feature vectors u  Select features to include in the final model u  Train and productionize the final model
  21. 21. def performExtraction( extractorClass, exportName, join_table=None, join_key_col=None, start_col=None, include_start_col=True, event_start_date=None ): customer_id_col = extractorClass.customer_id_col timestamp_col = extractorClass.timestamp_col extr_agrs = extractorArgs( customer_id_col, timestamp_col, join_table, join_key_col, start_col, include_start_col, event_start_date ) extractor = extractorClass(**extr_agrs) export_path = redshiftExportPath(exportName) return extractor.exportFromRedshift(export_path) # writes to Parquet Extract
  22. 22. def exportFromRedshift(self, path): export = self.exportDataFrame() writeParquetWithRetry(export, path) return sqlContext.read.parquet(path) .persist(StorageLevel.MEMORY_AND_DISK) def exportDataFrame(self): query = self.generateQuery() return sqlContext.read .format("com.databricks.spark.redshift") .option("url", urlOption) .option("query", query) .option("tempdir", tempdir) .load() Extract
  23. 23. Domain Knowledge is Critical The way that an expert organizes and represents facts in their domain. u  Guides feature extraction u  Prevents overfitting u  Vastly superior to unsupervised feature extraction (e.g., PCA)
  24. 24. Aggregate (Shard, Compress, Join) and Pivot! This dance is hard to choreograph
  25. 25. Aggregate (Shard, Compress, Join) and Pivot! This dance is hard to choreograph u  8,736 columns u  2.6 million rows Dataframes API is not optimized for extremely wide datasets
  26. 26. def generateQuery(self): return """ {0} FROM {1} GROUP BY customer_id, {2}, {3}, {4} """.format( self.selectClause(), self._tempTableName, self.bucketingExpr(), self.timestampCol, self.startDateExpr ) def perform(self): self.preprocessedDataFrame().registerTempTable(self._tempTableName) return sqlContext.sql(self.generateQuery()) Aggregate (Shard, Compress, Join) and Pivot!
  27. 27. Aggregate (Shard, Compress, Join) and Pivot!
  28. 28. Aggregate (Shard, Compress, Join) and Pivot! (0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0) ( 18, (6,16), (1,2) )
  29. 29. def perform(self): keyedMonthlyEvents = self.dataFrame.map(self.keyRow()) pivotRDD = keyedMonthlyEvents .combineByKey( self.initPivot(), self.pivotEvent(), self.combineDicts() ) .map(self.convertToRow()) .persist(StorageLevel.MEMORY_AND_DISK) return sqlContext.createDataFrame(pivotRDD, self.pivotedSchema()) Aggregate (Shard, Compress, Join) and Pivot!
  30. 30. Aggregate (Compress, Shard, Join) and Pivot!
  31. 31. Featurize u  "Explode" each customer's history into several "windows" of time.
  32. 32. Featurize u  "Explode" each customer's history into several "windows" of time. u  Define one or more prediction targets
  33. 33. Featurize u  "Explode" each customer's history into several "windows" of time. u  Define one or more prediction targets u  Standardize each historical feature
  34. 34. Featurize u  "Explode" each customer's history into several "windows" of time. u  Define one or more prediction targets u  Standardize each historical feature u  Persist on S3 as text files of compressed sparse vectors
  35. 35. Select Features
  36. 36. Select Features 1.  Randomly select a set of new features to test
  37. 37. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features
  38. 38. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model
  39. 39. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model 4.  Calculate the p-value for each feature
  40. 40. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model 4.  Calculate the p-value for each feature
  41. 41. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model 4.  Calculate the p-value for each feature 5.  Retain significant features 6.  Repeat
  42. 42. Production Model u  Spark ML makes parameter tuning easy u  Reusable modules!
  43. 43. brett.bevers@dollarshaveclub.com h+p://app.jobvite.com/m?33KSgiwI

×