Successfully reported this slideshow.
Your SlideShare is downloading. ×

The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Production

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
Loading in …3
×

Check these out next

1 of 25 Ad

The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Production

Download to read offline

The ‘feature store’ is an emerging concept in data architecture that is motivated by the challenge of productionizing ML applications. The rapid iteration in experimental, data driven research applications creates new challenges for data management and application deployment.

The ‘feature store’ is an emerging concept in data architecture that is motivated by the challenge of productionizing ML applications. The rapid iteration in experimental, data driven research applications creates new challenges for data management and application deployment.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Production (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Production

  1. 1. KILLER FEATURE STORE Nathan Buesgens Accenture Applied Intelligence USING SPARKML PIPELINES AND MLFLOW
  2. 2. Agenda Definitions of a Feature Store A clear need, many approaches. The Feature Flow Algorithm ML pipeline orchestration. The ML Pipeline Mesh Governance and automation.
  3. 3. DEFINITIONS OF A FEATURE STORE
  4. 4. ML LIFECYCLE SUCCESS CRITERIA VALIDATE BUSINESS HYPOTHESIS NEW BUSINESS INSIGHT A positive experimental result creates KPI lift in production. Regardless of production results, new business insights are captured and made discoverable (with a feature store). This accelerates future experimentation.
  5. 5. featurestore.org
  6. 6. FEATURE STORES THREE APPROACHES TO AUTOMATION Feature Store Approaches Feature “Ops” Automating Feature Data Delivery to ML Pipelines Feature “Modelling” Automating ETL/Feature Engineering Feature “Orchestration” Automating ML Pipeline Construction
  7. 7. FEATURE STORES THREE APPROACHES TO AUTOMATION Feature Store Approaches Feature “Ops” Automating Feature Data Delivery to ML Pipelines Feature “Modelling” Automating ETL/Feature Engineering Feature “Orchestration” Automating ML Pipeline Construction • Most common approach. • Data access pattern for ML pipelines. • Generally, post “feat. engineering”. • Supplement Data Governance with DS semantics.
  8. 8. TRAIN/TEST Data Science Semantics Extending the Data Governance Framework: An Example
  9. 9. TRAIN/TEST Data Science Semantics Extending the Data Governance Framework: An Example Customer Segmentation Train/Test Split … ML … customer segment features “preprocessed” sales data test data training data
  10. 10. TRAIN/TEST Data Semantics Extending the Data Governance Framework: An Example “preprocessed” sales data Sales Prospect Segmentation Train/Test Split … ML … prospect segment features Next Best Action Train/Test Split test data training data Assemble Features test data training data
  11. 11. TRAIN/TEST Data Semantics Extending the Data Governance Framework: An Example “preprocessed” sales data Sales Prospect Segmentation Train/Test Split … ML … test data training data prospect segment features Next Best Action Train/Test Split test data training data Assemble Features WHAT’S WRONG WITH THIS PICTURE?
  12. 12. FEATURE STORES THREE APPROACHES TO AUTOMATION Feature Store Approaches Feature “Ops” Automating Feature Data Delivery to ML Pipelines Feature “Modelling” Automating ETL/Feature Engineering Feature “Orchestration” Automating ML Pipeline Construction
  13. 13. FEATURE STORES THREE APPROACHES TO AUTOMATION Feature Store Approaches Feature “Ops” Automating Feature Data Delivery to ML Pipelines Feature “Modelling” Automating ETL/Feature Engineering AutoML Key Stakeholder: Citizen Scientist Feature “Orchestration” Automating ML Pipeline Construction
  14. 14. FEATURE STORES THREE APPROACHES TO AUTOMATION Feature Store Approaches Feature “Ops” Automating Feature Data Delivery to ML Pipelines Feature “Modelling” Automating ETL/Feature Engineering AutoML Key Stakeholder: Citizen Scientist Feature “Orchestration” Automating ML Pipeline Construction “Feature Flow” Key Stakeholder: ML Engineer
  15. 15. THE FEATURE FLOW ALGORITHM
  16. 16. MANAGE ML PIPELINES (not just models)
  17. 17. ML Pipeline Review source: https://spark.apache.org/docs/latest/ml-pipeline.html # Prepare training documents from a list of (id, text, label) tuples. training = spark.createDataFrame([…]) # Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr. tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression(maxIter=10, regParam=0.001) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) # Fit the pipeline to training documents. model = pipeline.fit(training) # Prepare test documents, which are unlabeled (id, text) tuples. test = spark.createDataFrame([…], ["id", "text"]) # Make predictions on test documents and print columns of interest. prediction = model.transform(test)
  18. 18. ML Pipeline Review source: https://spark.apache.org/docs/latest/ml-pipeline.html # Prepare training documents from a list of (id, text, label) tuples. training = spark.createDataFrame([…]) # Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr. tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression(maxIter=10, regParam=0.001) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) # Fit the pipeline to training documents. model = pipeline.fit(training) # Prepare test documents, which are unlabeled (id, text) tuples. test = spark.createDataFrame([…], ["id", "text"]) # Make predictions on test documents and print columns of interest. prediction = model.transform(test) What does this line do for me (as an engineer)?
  19. 19. FEATURE FLOW ORCHESTRATION ALGORITHM: FEATURE INFERENCE Feature Flow takes pipeline stages as input, builds a graph, then sorts the stages topologically. First, we iteratively infer the stages that need to be added to the pipeline to produce the necessary features. Then, we sort the stages topologically. Tokenize TFIDF Sentiment Est. THE “MONOLITHIC” PIPELINE (THE OLD WAY) tokenize = ... tfidf = ... sentiment = ... pipeline = Pipeline( stages=[ tokenize, tfidf, sentiment ]) Tokenize TFIDF Toxicity Est. tokenize = ... tfidf = ... toxicity = ... pipeline = Pipeline( stages=[ tokenize, tfidf, toxicity ]) FEATURE STAGE DEPLOYMENTS (THE NEW WAY) Tokenize tokens TFIDF vectors Sentimen t Est. sentiment tokens vectors Toxicity Est. toxicityvectors Tokenize tokens TFIDF vectors Sentimen t Est. sentiment tokens vectors Toxicity Est. toxicityvectors toxicitysentiment Tokenize tokens TFIDF vectors Sentimen t Est. tokens vectors Toxicity Est. vectors toxicitysentiment Tokenize tokens TFIDF Sentimen t Est. tokens vectors Toxicity Est. vectors toxicitysentiment Tokenize TFIDF Sentimen t Est. tokens vectors Toxicity Est. vectors toxicitysentiment
  20. 20. THEN, ELIMINATE ALL NODES WITH MULTIPLE INCOMING EDGES PER FEATURE. And, replace with nodes for the product of all incoming features. Feature: vectors FEATURE FLOW ORCHESTRATION ALGORITHM: FEATURE LINEAGE Feature Flow gives us the tools to experiment with subsets of our pipeline. The graph gets more complex when we are evaluating multiple strategies that create the same features. To manage multiple possible traversals of the graph, we maintain a lineage of each feature. AN EXAMPLE STAGE WITH MULTIPLE STRATEGIES Tokenize tokens TFIDF vectors Sentiment Est. sentiment tokens vectors Toxicity Est. toxicityvectors Word2Vec vectorstokens FIRST, BUILD THE GRAPH Tokenize TFIDF Sentiment Est. Word2Vec Tokenize TFIDF Word2Vec Sentiment Est. (TFIDF) Sentiment Est. (Word2Vec) Toxicity Est. (TFIDF) Toxicity Est. (Word2Vec) Toxicity Est.
  21. 21. THE ML PIPELINE MESH
  22. 22. SEPARATE CONCERNS OF ALGORITHMIC DESIGN FROM OPERATIONS Deployment Automation and Runtime Management Metadata Management and Discovery ML Pipeline Governance
  23. 23. Demo
  24. 24. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×