Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

idalab seminar #4 A journey from scikit-learn to Spark

420 views

Published on

Zalando is Europe’s leading online fashion retailer and currently on its way to become the platform for all fashion related business — designers, producers, logistic solutions, on-and off-line retailers. The new platform architecture challenges the company’s current in-house solutions – including the fraud detection infrastructure – to become more scalable, dependable and versatile.

This talk is a travelogue describing the journey of rewriting an in-production classification system from scratch using Scala and Spark to run on AWS. Dragiev and Krueger will start by looking at the drawbacks that are inherent to the old sklearn based solution running a static cluster, most prominently: hard maintenance, data bottlenecks, too coarse-grained parallelisation. Then, they will outline the design principles which are at the bottom of our new Scala/Spark-based solution. Often neglected aspects like model specification, data organisation and evaluation and keeping track of multiple models will also be discussed.

Dragiev and Krueger’s new solution mitigates the identified pain points by leveraging the features that Scala and Spark bring into play, in particular: strong typing, data parallelisation and easy scale out. The talk will end with a comparison between both solution, conducting measurements that highlight the performance gains experienced with Spark, for both learning and prediction times.

Published in: Data & Analytics
  • Login to see the comments

idalab seminar #4 A journey from scikit-learn to Spark

  1. 1. Agency for Data Science Machine learning & AI Mathematical modelling Data strategy Stanimir Dragiev,Tammo Krueger Towards scalable fraud detection A journey from scikit-learn to Spark idalab talks | January 20th 2017
  2. 2. A Journey from Scikit-learn to Spark 20.1.2017 Stanimir Dragiev & Tammo Krueger tammok @ssdpd
  3. 3. 2 EUROPE’S LEADING FASHION PLATFORM 15 countries 3 fulfillment centers +19 million active customers +€3.0 billion revenue +160 million visits per month +1.600 employees in tech tech.zalando.com radicalagility.org
  4. 4. 3 HOW IT LOOKS
  5. 5. 4 TECHNOLOGY HISTORY ~ 30 1600+
  6. 6. 5 WHAT WE DO FRAMEWORK CANDY PREVIOUS SETUP REDESIGN SPARK ODDITIES OUTLINE
  7. 7. WHAT WE DO
  8. 8. 7 TEAM PAYMENT ANALYTICS Goal: Estimate fraud risk for incoming orders This requires us to master: Machine Learning Production code A diverse tech stack: (Spark, Scala, R, SQL, AWS, Jenkins, Docker, Python, sh, ...)
  9. 9. 8 MODELING FRAUD PROBABILITIES Fraud probability pfraud beyond business rules ... … but can be modeled via machine learning: data-driven unbiased reproducible Challenges: multiple application domains large variety of models frequent updates
  10. 10. PREVIOUS SETUP
  11. 11. 10 ML ARCHITECTURE
  12. 12. 11 DEPLOYMENT OF MODELS
  13. 13. 12 SCALABILITY 1 https://blog.zalando.com/en/blog/how-zalando-becoming-online-fashion-platform-europe 2 https://corporate.zalando.com/en/zalando-continues-high-growth-path Number of orders increase2 Connect all stakeholders in fashion: online and offline retailers, advertising agencies, logistic services, ... online shop -> fashion platform1 Scalability of our services is now major concern 55.3m41.4m 68.7m (projected) 2014 20162015
  14. 14. 13 PAIN POINTS OF PREVIOUS SETUP No horizontal scaling of learning (memory bottleneck) *** Python: fast initial development, but: bad maintenance *** Too many deployment units for prediction service *** Coupled learning and data fetching (bottleneck: EXAsol) *** Learning on in-house cluster ⇝ bottleneck
  15. 15. REDESIGN
  16. 16. 15 SPARK IN SCALA ON AWS • (Nearly) unlimited computing power • Cheap data storage for unifying input data • Type safety • Real multithreading • Seamless horizontal scaling • High level APIs for machine learning (MLlib) Spark promises Scala promises AWS promises
  17. 17. 16 HORIZONTAL SCALING
  18. 18. 17 LEARNING TIME • Scenario 1: Old solution • Python-based learning framework • In-house cluster on a single machine with 10 cores • Scenario 2: New solution • Spark-based learning framework • run on AWS with 1 master and 5 workers → Overall learning time drops by factor two
  19. 19. 18 DATA SIZE • Py/slurm • limited by single node’s memory • Spark/AWS • horizontal scaling • 10x more data points in same time → Able to train with more data; higher accuracy
  20. 20. 19 SCALA RUNTIME Real multithreading on the JVM
  21. 21. 20 PREDICTION TIME
  22. 22. 21 LiNES OF CODE Language Comment Code Scala 1192 3322 Python 411 6314 http://cloc.sourceforge.net
  23. 23. SPARK ODDITIES
  24. 24. 23 SPARSE DATA submodels took too long to learn, sparse and of high dimensionality, Spark supports sparse data points…
  25. 25. 24 SPARSE DATA submodels took too long to learn, sparse and of high dimensionality, Spark supports sparse data points… but what does the optimizer do:
  26. 26. 25 SPARSE DATA: THE CONDENSER
  27. 27. 26 OTHER ODDITIES • Performance of random forest different • Double saving of random forests • A lot of (undocumented) settings • ml (dataframe, pipeline) versus mllib (RDD) • Distributed learning hard to debug • balancing of data sets just for linear methods
  28. 28. FRAMEWORK CANDY
  29. 29. 28 …. UNIFY DATA SOURCES ON AWS
  30. 30. 29 HIERARCHICAL MODELS
  31. 31. 30 TYPE SAFETY FOR CONFIGURATION
  32. 32. 31 LIFECYCLE OF A MODEL
  33. 33. SUMMARY
  34. 34. 33 REDESIGN CONCLUSION Steep learning curve for Scala Hard to debug distributed execution Maturity level of MLlib (as of version 2.0) Spark & AWS eliminated all pain points Speedup in learning and execution Better support for model lifecycle
  35. 35. THANK YOU! Stanimir Dragiev & Tammo Krueger https://tech.zalando.com/blog/scalable-fraud-detection-fashion-platform

×