Successfully reported this slideshow.
Your SlideShare is downloading. ×

Frame - Feature Management for Productive Machine Learning

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 39 Ad

Frame - Feature Management for Productive Machine Learning

Download to read offline

Presented at the ML Platforms Meetup at Pinterest HQ in San Francisco on August 16, 2018.

Abstract: At LinkedIn we observed that much of the complexity in our machine learning applications was in their feature preparation workflows. To address this problem, we built Frame, a shared virtual feature store that provides a unified abstraction layer for accessing features by name. Frame removes the need for feature consumers to deal directly with underlying data sources, which are often different across computing environments. By simplifying feature preparation, Frame has made ML applications at LinkedIn easier to build, modify, and understand.

Presented at the ML Platforms Meetup at Pinterest HQ in San Francisco on August 16, 2018.

Abstract: At LinkedIn we observed that much of the complexity in our machine learning applications was in their feature preparation workflows. To address this problem, we built Frame, a shared virtual feature store that provides a unified abstraction layer for accessing features by name. Frame removes the need for feature consumers to deal directly with underlying data sources, which are often different across computing environments. By simplifying feature preparation, Frame has made ML applications at LinkedIn easier to build, modify, and understand.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Frame - Feature Management for Productive Machine Learning (20)

Advertisement

Recently uploaded (20)

Frame - Feature Management for Productive Machine Learning

  1. 1. Frame – Feature Management for Productive Machine Learning ML PLATFORMS MEETUP – 16 AUGUST 2018 DAVID STEIN – LINKEDIN
  2. 2. Agenda • Problem and Motivation • Solution Overview • Impact and Challenges
  3. 3. Frame –Virtual Feature Store • What • Abstraction layer for feature access • Unified across environments and data sources • Applications get features “by name” in a global namespace • Why • Removes application’s need to deal with differences across environments and across data sources • Facilitates sharing features across applications • Allows applications to be easier to build, modify, and understand
  4. 4. Problem Overview
  5. 5. Machine Learning Productivity • NYT article estimates 50-80% of data scientists’ time spent on menial tasks, a.k.a. data munging1 • Why? • Can we automate data munging? • We explored our systems at LinkedIn to find ways to address this problem. 1 https://nyti.ms/2kDSVci
  6. 6. Much of the complexity in ML systems is in feature preparation workflows
  7. 7. Problem: Feature Preparation IsTedious • Many data sources for features • Applications require custom code (Spark/Pig/etc.) to glue them together • Data sources are different across environments (online/offline/search) • Different APIs (fetching vs joining) make it hard to share code • Applications must coordinate their online/offline flows Much of the complexity in ML systems is in feature preparation workflows
  8. 8. Feature Preparation • Many data sources for features • Applications JOIN and massage the data using Spark/Pig/etc.
  9. 9. Feature Preparation • Many data sources for features • Applications JOIN and massage the data using Spark/Pig/etc. • Data sources are different across environments (online/offline)
  10. 10. Feature Preparation • Many data sources for features • Applications JOIN and massage the data using Spark/Pig/etc. • Data sources are different across environments (online/offline) • Each application needs to repeat work to adapt to the data sources
  11. 11. Complexity is Costly • Complexity makes systems difficult to modify • Example: Adding a feature to a model required significant effort, in some applications • Adding new features should be simple. • How to make it easy in practice?
  12. 12. Solution Overview
  13. 13. Virtual Feature Store • Goal: Hide the underlying storage from the ML application
  14. 14. Virtual Feature Store • Goal: Hide the underlying storage from the ML application • No need to materialize all features into "true" feature store • Use existing data sources • Define name mappings via configuration
  15. 15. Virtual Feature Store • Goal: Hide the underlying storage from the ML application • No need to materialize all features into "true" feature store • Use existing data sources • Define name mappings via configuration • Provide abstraction over: • Online/offline differences • Differences across data sets
  16. 16. Key Idea • Enable users to access features by name • Users should be able to specify WHAT features they need, without specifying HOW/FROMWHERE Goal: Access should be as simple as an import statement, like: import com.linkedin.member.profile.Title
  17. 17. Analogy: Software Package Management
  18. 18. Solution Overview • Every feature has a name in a global namespace • Applications (consumers) reference/access features by name. • Platform sets up the join/fetch automatically • Feature owners (producers) define anchors for their features, in each environment where they are needed • Offline anchors point to HDFS data sets • Online anchors point to REST services and DB tables • All of this via simple configs
  19. 19. Anchors defines HOW and FROMWHERE a feature gets accessed in a given environment
  20. 20. Anchors Defined via configuration • Offline anchors point to Hadoop HDFS paths • Online anchors point to databases and RPC services
  21. 21. Separate Feature Definition From Usage FeatureA • comes from hdfs://foo/bar/baz • extract field1 FeatureB • comes from hdfs://abc/def/xyz • extract log(field2 * field3) “I need FeatureA, FeatureC, FeatureZ, ...” “I need FeatureC, FeatureD, FeatureX, ...” “I need FeatureX, FeatureY, FeatureZ, ...” FeatureC • … … … Feature Owner Specifies HOW/FROM-WHERE Feature Consumer Specifies WHAT feature names
  22. 22. DifferentAnchors for Different Environments “I need FeatureA, FeatureC, FeatureZ, ...” “I need FeatureC, FeatureD, FeatureX, ...” “I need FeatureX, FeatureY, FeatureZ, ...” FeatureC • … Feature Owner Specifies HOW/FROM-WHERE Feature Consumer Specifies WHAT feature namesFeatureA • comes from DB table “abcd” • extract fieldX FeatureB • comes from REST service “foo” • extract log(fieldY * fieldZ) (Different Anchors for Online Environment)
  23. 23. Architecture Overview Abstract Layer • Common feature namespace • Feature type system Environment-specific Engines • Tools for accessing features by name in each environment • Anchors – configs that define how a feature gets loaded in a given environment
  24. 24. Common Feature Namespace • Features are defined to have the following properties: • Feature Name: e.g. member_profile_skills • Entity Domain: e.g. MEMBER • FeatureType: e.g. CATEGORICAL
  25. 25. Environment-Specific Engines Join Fetch
  26. 26. Frame-Offline • Clients add FeatureJoinJob (Spark-based) as a stage in their Hadoop workflows • FeatureJoinJob is configured by a “join config,” which lists the names of required features • Executes LEFT OUTER JOIN against client’s observation data
  27. 27. Feature Join
  28. 28. Frame-Online • “Thick client” library executes fetch requests against configured data sources/services (Databases, REST services) • Client code requests features by name, e.g. “Fetch features ‘featureA, featureB, featureC’ for memberId:123.”
  29. 29. Feature Modules • Feature configs managed like code modules • Client module/service gets an assembled composite of dependency configs plus optional local config (via a build script plugin) • Enables sharing across projects
  30. 30. Anchors Offline – HDFS { source: "hdfs://path/to/my/data/#LATEST" key: "memberId" features: { featureA:"(field1 > 0.5)" featureB:"(field1 >= 1.0) && (field3 == 'foo')" featureC:"toCategorical(field3)" featureD:"toCategorical(field3 + '.' + field7)" } } Multi-key features also supported
  31. 31. Anchors Online { source: "jobPosting_inferred_data" features: { jobPosting_inferred_topSkills: "([$.skill_id : $.score] in topSkills)" } } "jobPosting_inferred_data": { dbName: "some_db" tableName: "some_table" keyExpr: "createUrn('job', key[0])" }
  32. 32. Impact & Challenges
  33. 33. EarlyVictories • Multiple LinkedIn ML projects onboarded • Users report that Frame reduces feature experimentation time by more than one half
  34. 34. Observations • Frame’s auto-planned JOINs often outperform custom workflows • Simplified workflows lead to fewer bugs, metrics lift
  35. 35. Challenges • How to help users discover useful features • How to scale when users ask for too many features: “Import all of LinkedIn’s features into my model, please!” • How to estimate costs of JOINs and fetch queries, and make costs transparent to users • How to automatically decide what features should be materialized for efficiency, based on usage trends
  36. 36. Improve machine learning productivity by simplifying feature management
  37. 37. Thank you

Editor's Notes

  • 1/4min
  • 1/4min
  • 1.5min
  • 1min
  • 0.5min Merge with above slide?
  • 0.5min Merge with above slide?
  • 0.25min
  • 0.5 min
  • 0.5
  • 0.5min – keep the visual from previous slide?
  • 0.25min
  • 0.5min
  • 0.25min
  • 0.25min
  • 0.25min
  • 1min
  • 1min
  • 1min
  • 0.5min
  • 0.25min
  • 1min
  • 0.5min
  • 0.5min
  • 0.5min
  • 1min
  • 0.5min
  • 1min
  • Rework this as a more illustrative example?
    1min
  • 0.5min
  • 1min
  • 1min – need to mention discovery UI
  • 0.5min Merge with above slide?

×