Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Science, Delivered Continuously @ GOTO Berlin 2017

194 views

Published on

A talk by Dr. Arif Wider (ThoughtWorks) and Christian Deger (AutoScout24)

AutoScout24 is the largest online car marketplace Europe-wide for new and used cars. With more than 2.4 million listings across Europe, AutoScout24 has access to large amounts of data about historic and current market prices and wants to use this data to empower its users to make informed decisions about selling and buying cars. We created a live price estimation service for used vehicles based on a Random Forest prediction model that is continuously delivered to the end user.

Predictive analytics of such sort is often only used for guiding company internal decision making. Delivering a predictive analytics product straight to the end user poses an entirely different set of requirements with respect to (1) performance and (2) automated quality control.

In order to avoid the effort of handcrafting a high-performance implementation of a complex prediction model, many companies fall back to use primitive prediction models in such a situation. Learn how we achieved superb performance and scalability without the need for manual optimization or sacrifices in terms of prediction accuracy.

For quality control, Continuous Delivery is already an established approach to modern web application development that allows for much shorter product release cycles and therefore yields the ability to rapidly innovate and adapt to user needs. However, in predictive analytics Continuous Delivery has been rarely applied so far. Learn how automated verification using live test data sets in a continuous delivery pipeline allows us to release model improvements with confidence at any given time. This way our users can benefit immediately from the work of our data scientists.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Data Science, Delivered Continuously @ GOTO Berlin 2017

  1. 1. 1 DATA SCIENCE, DELIVERED CONTINUOUSLY Arif Wider & Christian Deger @arifwider @cdeger
  2. 2. Christian Deger Chief Architect cdeger@autoscout24.com @cdeger
  3. 3. Dr. Arif Wider Senior Consultant/Developer awider@thoughtworks.com @arifwider
  4. 4. PL S RUS UA RO CZ D NL B F A HR I E BG TR 18countries 2.4m+cars & motos 10m+users per month
  5. 5. The task: A consumer-facing data product 5GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  6. 6. The task: A consumer-facing data product 6GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  7. 7. The task: A consumer-facing data product 7GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  8. 8. The prediction model: Random forest 8 Car listings of last two years GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger Volkswagen Golf
  9. 9. How to turn an R-based prediction model into a high-performance web application? 9 ? GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  10. 10. How to turn an R-based prediction model into a high-performance web application? 10GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  11. 11. How to turn an R-based prediction model into a high-performance web application? 11GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  12. 12. How to turn an R-based prediction model into a high-performance web application? 12  Continuous Delivery! GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  13. 13. Application code in one repository per service. Typical delivery pipeline GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  14. 14. Application code in one repository per service. CI Deployment package as artifact. Typical delivery pipeline GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  15. 15. Application code in one repository per service. CI Deployment package as artifact. CD Deliver package to servers Typical delivery pipeline GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  16. 16. Continuous delivery pipelines 16 Prediction Model Pipeline GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  17. 17. Continuous delivery pipelines 17 Prediction Model Pipeline Web Application Pipeline GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  18. 18. The price for CD: Extensive model validation 18GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  19. 19. The price for CD: Extensive model validation 19GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  20. 20. Lessons learned 20 Form a cross-functional team of data scientists & software engineers! Software engineers … learn how data scientists work … and understand the quirks of a prediction model Data Scientist … learn about unit testing, stable interfaces, git, etc. ... get quick feedback about the impact of their work  Model and product iterations become much faster! GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  21. 21. Lessons learned 21 Generating gigabytes of Java code is a challenge for the JVM Use the G1 garbage collector Turn off Tiered Compilation  Do extensive warm-ups GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  22. 22. Lessons learned – Warm up 22GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  23. 23. Lessons learned 23 The approach of applying Continuous Delivery to Data Science is useful independently of the tech  Successfully applied similarly to a Python- and Spark-based project  Even more useful when quick model evolution is required because of rapidly changing inputs (e.g. user interaction) GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  24. 24. Conclusions 24  Continuous Delivery allows us to bring prediction model changes live very quickly.  Only extensive automated end-to-end tests provide confidence to deploy to production automatically.  Java code generation allows for very low response times and excellent scalability for high loads but requires plenty of memory. GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  25. 25. Conclusions: Price evaluation everywhere 25GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  26. 26. Conclusions: Price evaluation everywhere GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger 26
  27. 27. Conclusions: Price evaluation everywhere GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  28. 28. Conclusions: Price evaluation everywhere GOTO Berlin 2017 Data Science, Delivered Continuously – A. Wider & C. Deger
  29. 29. 29 THANK YOU QUESTIONS? Arif Wider & Christian Deger @arifwider @cdeger

×