Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Building A Feature Factory

Download to read offline

Building, managing, and maintaining thousands of features across thousands of models. Building features can be repetitive, tedious and extremely challenging to scale. We will explore the ‘Feature Factory’ built at Databricks and implemented at several clients and the processes that are imperative for the democratization of feature development and deployment. The feature factory allows consumers to ensure repetitive feature creation, simplifies scoring and enables massive scalability through feature multiplication.

Building A Feature Factory

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Daniel Tomes, Databricks Building A Feature Factory #UnifiedDataAnalytics #SparkAISummit
  3. 3. Me 3#UnifiedDataAnalytics #SparkAISummit Me • Norman, OK – Undergrad OU – SOONER – Masters – OK State • ConocoPhillips • Raleigh, NC • Cloudera • Databricks
  4. 4. 4#UnifiedDataAnalytics #SparkAISummit Retail •Channel •Sales •Costs •Revenue •Employee •Vendor •Coupon •Campaign Farming/Oil •Geospatial •IoT •Weather •Customer •Equipment •Service •Gov Fraud Pattern •Manipulation •Market •Purchases •Contract •Social •Interaction Ads •Campaign •Bids •Social Marketing •Partners •Sales •Customer •Region •Competitor •Loyalty InfoSec •Intrusions •Octets •ISPs General •Weather •Geo •Municipal •Market •Competitor
  5. 5. Implement Through Metrics Metrics are easy: Fewer Better Defined Better Documented Likely been describing the business for years #UnifiedDataAnalytics #SparkAISummit 5 Grouping & Multiplying Concepts = Feature Engineering
  6. 6. Measure FeatureMetric Measure vs Metric vs Feature 6#UnifiedDataAnalytics #SparkAISummit An individual measurable property or characteristic of an observation. A raw, aggregated or altered metric that can provide predictive power in pattern recognition, classification, and regression. Numbers or values that can be summed and/or averaged, such as sales, leads, distances, durations, temperatures, and weight A quantifiable measure that is used to track and assess the status of a specific process
  7. 7. Measure 31 FeatureMetric Measure vs Metric vs Feature 7#UnifiedDataAnalytics #SparkAISummit An individual measurable property or characteristic of an observation. A raw, aggregated or altered metric that can provide predictive power in pattern recognition, classification, and regression. Numbers or values that can be summed and/or averaged, such as sales, leads, distances, durations, temperatures, and weight A quantifiable measure that is used to track and assess the status of a specific process
  8. 8. Measure 31 Feature Metric +31 Country Code Measure vs Metric vs Feature 8#UnifiedDataAnalytics #SparkAISummit An individual measurable property or characteristic of an observation. A raw, aggregated or altered metric that can provide predictive power in pattern recognition, classification, and regression. Numbers or values that can be summed and/or averaged, such as sales, leads, distances, durations, temperatures, and weight A quantifiable measure that is used to track and assess the status of a specific process
  9. 9. Measure 31 Feature .002428571 Metric +31 Country Code Measure vs Metric vs Feature 9#UnifiedDataAnalytics #SparkAISummit An individual measurable property or characteristic of an observation. A raw, aggregated or altered metric that can provide predictive power in pattern recognition, classification, and regression. Numbers or values that can be summed and/or averaged, such as sales, leads, distances, durations, temperatures, and weight A quantifiable measure that is used to track and assess the status of a specific process
  10. 10. Identify Metrics & Features Write code that writes code Join, Union, Agg Optimize How It Goes Identify data scope and scale Understand target if applicable Scope down to relevant data Scope up to include more data Explore available data Understand data models Understand business rules 10#UnifiedDataAnalytics #SparkAISummit Modeling Data Filtering Twisting (Sales X Time Ranges) Tweaking (Scaling/Binning) Clustering/PCA/Correlation Pearson/Outlier Model Stacking Data Leaks Model Tuning Evaluation Data ScientistData Engineer
  11. 11. Feature Factory Identify data scope and scale Understand target if applicable Scope down to relevant data Scope up to include more data Explore available data Understand data models Understand business rules 11#UnifiedDataAnalytics #SparkAISummit Modeling Data Filtering Twisting (Sales X Time Ranges) Tweaking (Scaling/Binning) Clustering/PCA/Correlation Pearson/Outlier Model Stacking Data Leaks Model Tuning Evaluation Data ScientistData Engineer Identify Metrics & Features Write code that writes code Join, Union, Agg Optimize
  12. 12. End Result 12#UnifiedDataAnalytics #SparkAISummit Concept Feature Set Feature Family Feature Factory Base DF Magic Sauce
  13. 13. Why A Feature Factory Rapidly prototype and deliver 1000s of features Build them all and let science decide 13#UnifiedDataAnalytics #SparkAISummit Univariate Selection Algorithms Feature Importance Models (XGBoost) Correlation Matrices High-Dimensional PCA
  14. 14. Why A Feature Factory Feature reusability Consistent logic (joins and formulas) Optimized feature generation Process Documentation – Finally! Scalable (10K+ features) 14#UnifiedDataAnalytics #SparkAISummit
  15. 15. What Is A Feature Factory Code Base - APIs Accelerator – Configurable – Not OEM Extensible & Customizable…Incomplete 15#UnifiedDataAnalytics #SparkAISummit
  16. 16. How It Works Land the scaffolding Gut the demo Structure, Configure your Concepts Initialize your data and your metrics 16#UnifiedDataAnalytics #SparkAISummit
  17. 17. Abstract Architecture 17#UnifiedDataAnalytics #SparkAISummit
  18. 18. Concrete Example (TPC-DS) 18#UnifiedDataAnalytics #SparkAISummit Store Sales Catalog Sales Web Sales
  19. 19. Concrete Example (TPC-DS) 19#UnifiedDataAnalytics #SparkAISummit Store Returns Catalog Returns Web Returns
  20. 20. TPC-DS Architecture 20#UnifiedDataAnalytics #SparkAISummit Store Web Catalog
  21. 21. 21#UnifiedDataAnalytics #SparkAISummit Implemented Architecture Master Concept Implemented Concept Feature Family Feature
  22. 22. Implementation 22#UnifiedDataAnalytics #SparkAISummit 2. Implement Store 1. Rename & Define Channel 3. Implement Feature Family 4. Implement Features Process 1. Define the concept (channel) 2. Implement the concepts 3. Build Feature Families 4. Implement Features
  23. 23. Feature - Definition 23#UnifiedDataAnalytics #SparkAISummit Store -> Sales -> Feature
  24. 24. Highlights - Multipliers 24#UnifiedDataAnalytics #SparkAISummit Feature Families – Sales – Customer – Weather – Geo Multipliers – Time – Categorical – Trends Base Metrics (Sales/Customer) Categorical (Category) Time Window (Multiplier) Base Metric (Weather/Geo) Total_Sales_6m_Sunny_Category-MensShoes Total_Customers_3m_GeoRange_CheckoutMethod-Self
  25. 25. Highlights - Multipliers 25#UnifiedDataAnalytics #SparkAISummit Sales Metrics (8) Time Windows 1m 3m 6m 12m 1w 2w 3w 4w Customer Metrics (8) Categorical Dims Item Category (8) Demographics (12) 8 * 9 * 8 * 8 * 12 = 55,296 possible features & < 20 lines of code Common Example 8 sales metrics * 4 time windows * 5 dims with avg of 12 distincts 8 * 4 * 5 * 12 = 1,920 features Send to feature importance/selection process and pick top n
  26. 26. Highlights – Joiners/Groupers 26#UnifiedDataAnalytics #SparkAISummit Automatic aggs/joins when features need it Accurate & Optimized Once & Only once
  27. 27. Highlights – Canned Data 27#UnifiedDataAnalytics #SparkAISummit Where is relevant Data? Just browse the data related to the concept
  28. 28. Highlights – Canned Data 30#UnifiedDataAnalytics #SparkAISummit Where is relevant Data? Just browse the data related to the concept
  29. 29. Highlights – Date/Time Manager 31#UnifiedDataAnalytics #SparkAISummit Unified Time Definition Define it once and be done with it Simplified Filtering Time-Based Splits (ML)
  30. 30. Highlights - Configs 32#UnifiedDataAnalytics #SparkAISummit Time Helpers Canned Data Joiners Mutable Automatic Shareable (save & load) Enables Repeatability Transparent Extensible
  31. 31. Highlights – Easy Docs 33#UnifiedDataAnalytics #SparkAISummit Add the docs to the Metrics Add the docs to the multipliers Features are now self-documenting
  32. 32. DEMO 34#UnifiedDataAnalytics #SparkAISummit
  33. 33. OPEN SOURCE 35#UnifiedDataAnalytics #SparkAISummit
  34. 34. Questions 36#UnifiedDataAnalytics #SparkAISummit
  35. 35. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  • gianvitosiciliano

    Mar. 11, 2020
  • puneethmishra

    Nov. 26, 2019

Building, managing, and maintaining thousands of features across thousands of models. Building features can be repetitive, tedious and extremely challenging to scale. We will explore the ‘Feature Factory’ built at Databricks and implemented at several clients and the processes that are imperative for the democratization of feature development and deployment. The feature factory allows consumers to ensure repetitive feature creation, simplifies scoring and enables massive scalability through feature multiplication.

Views

Total views

773

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

52

Shares

0

Comments

0

Likes

2

×