Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dataiku productive application to production - pap is may 2015

1,686 views

Published on

Beyond Predictive Analytics : Deploying apps to production and keep them improving

Some smart companies have been putting predictive application in production for decades. Still, either because of lack of sharing or lack of generality, there is still no single and obvious way to put a predictive application in production today.

As a consequence, for most companies, transitioning analytics from development to production is still “the next frontier”.

Behind the single word "production” lays a great number of questions like: what exactly do you put in production: data, model, code all three ? Who is responsible for maintenance and quality check over time : business, tech or both ? How can I make my predictive app continuously improve and check that it delivers the promised business value over time ? What are the best practice for maintenance and updates by the way ? Will my data scientists keep working after first development or should I lay half of them off ? etc…

Let’s make a small analogy with the development of web sites in the 90’s and early 00’s :
Back then, the winners where not necessarily the web sites with an amazing design, but a winner had clearly made the necessary efforts and had a robust way to put their web site reliabily in production

Today, every web developper can enjoy the confort of Heroku, Amazon, Github, docker, Angular, bootstrap … and so we forget. How much time before we get the same confort for the predictive world ?

  • Be the first to comment

Dataiku productive application to production - pap is may 2015

  1. 1. Imagine How 5 Years from Now will predictive applications be put in production Our Goal Today How are we doing today ? What is difficult ? What should be simpler?
  2. 2. What is a predictive application ? Churn Prevention Fraud Detection Demand Forecast Targeting Maintenance Match Making Ad Bidding Drug Studies Pricing Ranking
  3. 3. This discussion not relevant to all Churn Maintenance Drug Studies Multi-Years Multi-Years Multi-Years Weekly Weekly Yearly Bidding Two Weeks Sub-Second Data Span Retrain every … Score every… Yearly Day Monthly Monthly Production = Dev Online Learning
  4. 4. Not just a “model” Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision Data Collection Let’s call this a Predictive Service Specification
  5. 5. How much effort ? Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision 20% 30% 25% 5% 5% 15% Data Collection
  6. 6. Who Does What ? Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision Data Domain Engineers Data AnalystsData ScientistsBusiness Intelligence Engineers
  7. 7. Huge Variety of Tech Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision Data Collection ETL ? Ad-Hoc? ETL ? Ad-Hoc? ETL ? SQL ? R ? Python ? Matlab ? R ? Python ? R ? Python ? SAS? Java / Python Business Rules Management System Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision
  8. 8. From Build to Run Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision ? Input Data Decision Build Time Run Time
  9. 9. How People Do that Today ? Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision PMMLETL WebServiceScript/SQL Data Collection A Predictive Service = Up to 4 different “Applications" that can run out-of-sync
  10. 10. Some Integrated Per-Platform Approach in Database in SAS in Hadoop/Spark SQL Commercial Warehouse + Scoring UDF End-to-end integration script Ad-hoc development
  11. 11. Top Companies invested a lot Each probably >5M$ in their ML production platform
  12. 12. Reason 1 : Prohibitive Costs kill projects Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision RSQL PythonR Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision SQLETL WebServiceSQL PMML 300K$ 50K$ 200K$100K$ 50K$ 650K$
  13. 13. Reason 2: Distribution Drift New behaviour New product New competitor Model stops working as planned You need to be able to do same week update
  14. 14. Reason 3: Mitigate with Data Hazards You need to be able to do same week update Most interesting “Big Data” Sources are fragile
  15. 15. Reason 4: Decide is beyond Predict Most Interesting Problems Require To Combine Models + Heuristics + Non-local Optimization
  16. 16. Reason 5: “Suits ready” for scalability Data Prep Domain Specific Feature Eng. Feature Eng. Model(s) Scoring /Decision Your CTO could certainly maintain it up and running all by himself Your CTO could certainly maintain it up and running all by himself
  17. 17. Imagine the Dream Platform That Would Solve All This ? Let’s call it Blue Box New Data Decision
  18. 18. Feature : Cleansing, Enrich and Merge Blue Box must be the perfect Data Blending runtime
  19. 19. Feature: Aggregating Data Raw Events Stream Aggregate State Consolidating History Must be part of Blue Box 1TB-100TB+ 100MB-1OGB
  20. 20. Feature : External Data Compliant main data enriched main data additional data e.g. Census, Map, Etc.. Third Data Data Must Be “In” the Blue Box
  21. 21. Feature : Update Data Service Smart Lazy Human A/B Test Support in Blue Box Decision Ver. A Decision Ver. B P D F M S New Model
  22. 22. Feature : Programatic Decision Need for Business Compliant “Real-Time” Rules in Blue Box model 1 model 2 model 3 if combine with if proba > 0,63 decision A else decision B if proba > 0,79 decision A else decision B
  23. 23. Feature : Audit and Logs Smart Lazy Human ? Blue Box needs to keep track of its decisions and Why Decision Cause Log
  24. 24. External Data Advanced Join / Matching Ad-Hoc Transformation Python / R / Spark DataFrame transformations SQL Like Transformations Scoring Causes / Audit A/B Test Support Model Rollback / Versioning Prediction Log. Stats / Audit Ad-hoc scoring/decision code/scoring Open Source What does Blue Box look like? ?
  25. 25. Interesting / Potential Open Source Project Real-Time Entity Update, Management, Scoring Open Source PMML Scoring in Java Oryx: Lambda Architecture built on Spark and Kafka, with specialisation on real-time machine learning
  26. 26. How will we create the “blue box” ? ? Specification ? PMML Extension ? Open Source Framework ? Hadoop / Spark Specific ?
  27. 27. Thank you ! is blue Convince decisions makers to make data their competitive advantage florian.douetteau@dataiku.comjobs@dataiku.com Wanna work on this topic ? Wanna share your dream features?

×