Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How mobile.de brings Data Science to Production for a Personalized Web Experience

156 views

Published on

As Germany's biggest online car marketplace, mobile.de provides a personalized web experience. Our Data Team leverages the interactions of our users to infer their preferences. For this tasks we often apply Python and Spark to wrangle massive amounts of data. In this talk, we are going to present our personalization use-cases as well as the application of PySpark in production.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

How mobile.de brings Data Science to Production for a Personalized Web Experience

  1. 1. How Mobile.de brings Data Science to Production for a Personalized Web Experience Dr. Markus Schüler & Dr. Florian Wilhelm 2018-07-08, PyData 2018, Berlin
  2. 2. 2 Introduction @FlorianWilhelm FlorianWilhelm florianwilhelm.info Dr. Florian Wilhelm Data Scientist inovex GmbH Dr. Markus Schüler Data Scientist & Team Lead mobile.de GmbH
  3. 3. 3 Agenda • General Introduction • Personalization Use Cases at mobile.de • Predicting Car Buying Intent • Python for Big Data Processing • Optimizing Performance
  4. 4. 4
  5. 5. 5 MOBILE.DE GERMAN MARKET LEADER 13.5 MIO UNIQUE USER PER MONTH 1.6 MIO VEHICLES 290 EMPLOYEES DREILINDEN / FRIEDRICHSHAIN BERLIN HEADQUARTERS Part of ebay Tech
  6. 6. 6 IT-project house for digital transformation: ‣ Agile Development & Management ‣ Web · UI/UX · Replatforming · Microservices ‣ Mobile · Apps · Smart Devices · Robotics ‣ Big Data & Business Intelligence Platforms ‣ Data Science · Data Products · Search · Deep Learning ‣ Data Center Automation · DevOps · Cloud · Hosting ‣ Trainings & Coachings Using technology to inspire our clients. And ourselves. inovex offices in Karlsruhe · Cologne · Munich · Pforzheim · Hamburg · Stuttgart. www.inovex.de
  7. 7. 7 Why Recommendations?Why Personalization? Inspiration Engagement Memory of past interactions You are unique!
  8. 8. 8 Why Personalization? Data-Driven Personalization Improves: User Experience User Engagement Source: https://www.kleinerperkins.com/perspectives/internet-trends-report-2018
  9. 9. 9 Personalization at mobile.de User Event Tracking & Storage Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 % Daily preference profiles Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 % Daily activity profiles Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 % Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 % Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 % Recommendations Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 % Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 % Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 % Segmentation User Car Preferences User Interactions
  10. 10. 10 Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Marketing Last Action: Yesterday Frequent User User 12345 User Preferences based on User’s interactions User Car Preference Example User Preferences Anonymous
  11. 11. 11 Uncertainty Quantification Number of user events Impact of prior (avg. user) User profile à Posterior User Profile + Posterior probability∝Likelihood×Prior probability Bayesian Approach 30% Volkswagen25% gray 50% automatic8% SUV10,000 € Prior based on all users User Preferences Posterior User Preferences Impact of Prior (avg. user) Number of user events
  12. 12. 12 Recommendation All Listings Content-based Information (User Preferences) Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 % Collaborative Information P P P P P Mobile.de Recommendation Engine Features of vehicle
  13. 13. 13 Personalization at mobile.de User Event Tracking & Storage Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 % Daily preference profiles Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 % Daily activity profiles Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 % Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 % Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 % Recommendations Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 % Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 % Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 % Segmentation User Car Preferences User Interactions
  14. 14. 14 Different User Intents “I have no idea about cars. I need basic information and guidance.” “I’m a car expert. Lead me to the best deals in the fastest way.” “I love to browse expensive cars, yet I have no buying intent.” “As a dealer, I need detailed data to compare my own listings with my competitor’s”
  15. 15. 15 Events of a Car Buying Journey contacts parkings views
  16. 16. 16 control buyers events total 72,621,069 2,500,771 median events 153 188 median days active 22 15 Analysing events of car buyers
  17. 17. 17 User Events: Event counts 0.0 0.2 0.4 0.6 0.8 1.0 0.000.050.100.150.200.25 Event count over user journey contact Position in user journey Averagecount Buyer Control Buyer slope p = 1.815e−22 *** Control intercept diff p = 9.823e−02 . Control slope diff p = 9.956e−04 *** local mean linear model lowess 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.0 Event count over user journey parking Position in user journey Averagecount Buyer Control Buyer slope p = 7.999e−06 *** Control intercept diff p = 1.399e−21 *** Control slope diff p = 6.702e−06 *** local mean linear model lowess 0.0 0.2 0.4 0.6 0.8 1.0 051015202530 Event count over user journey search Position in user journey Averagecount Buyer Control Buyer slope p = 6.694e−51 *** Control intercept diff p = 1.141e−01 Control slope diff p = 9.044e−07 *** local mean linear model lowess 0.0 0.2 0.4 0.6 0.8 1.0 0510152025 Event count over user journey view Position in user journey Averagecount Buyer Control Buyer slope p = 1.824e−08 *** Control intercept diff p = 2.506e−45 *** Control slope diff p = 2.824e−02 * local mean linear model lowess contactparking viewsearch
  18. 18. 18 User Events: Duplicated views 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 1.0 Position in user journey • Buyers look more often at cars they have seen already than the control group and their ratio increases faster (both significant) Amountofduplicatedviews Buyer Control
  19. 19. 19 When did buyers interact with the car they bought? § Buyers view “their” car the most 4/5th along their user journey 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% When do buyers view the car they buy? Position in user journey %ofusers 0 5 10 15 Position in user journey %ofusers
  20. 20. 20 ML Model: How close to buy? § Aim: predict how likely a user is to make his buying decision today § Personalization § Highlight dealer contact details § Provide car buying assistance
  21. 21. 21 Feature Generation Features: § Event counts (view, search, contact, parking) § % event of all events (like %views among all event) § a=Number of active days, b=Max-diff active days, a/b § Additional features: § Views/(Search+View) § % of duplicated views among all views Buying date (=0) 30 days 0-2 days3-9 days10-30 days ratio
  22. 22. 22 Modelling § Logistic Regression § Automatic Feature Selection § start from different sub-selections of features (like “all”, “no ratios”, etc.) § allow addition and subtraction of features based on maximizing AIC § needed to prevent overfitting § Window optimization
  23. 23. 23 Window size optimization § Used window size and number as optimization criterion Buying date (=0) 30 days 0-2 days3-9 days10-30 days 0 days1-9 days10-30 days 0 days1-7 days8-30 days 0-9 days10-19 days20-30 days 0 days1-4 days10-30 days 5-9 days 0 days1-7 days8-30 days
  24. 24. 24 Modelling § Logistic Regression § Automatic Feature Selection § start from different sub-selections of features (like “all”, “no ratios”, etc.) § allow addition and subtraction of features based on maximizing AIC § needed to prevent overfitting § Window optimization § Cross-Validation (15 fold, 70/30 train/test split)
  25. 25. 25 closeToBuy_now_0−1−10−30_cid closeToBuy_now_0−1−7−30_cid loseToBuy_now_0−10−20−30_cid closeToBuy_now_0−3−10−30_cid closeToBuy_now_0−5−10−30_cid Modelling statistics: closeToBuy_now_cid 0.65 0.70 0.75 0.80 Accuracy Sensitivity Specificity Results Prediction: The user made his buying decision today Best Model: 72% Accuracy / 68% Sensitivity / 76% Specificity Model1 Model2 Model3 Model4 Model5
  26. 26. 26 Buys tomorrow, next week, next two weeks 0% 10% 20% 30% 40% 50% 60% 70% 80% Buy Today Buy Tomorrow Buy in a Week Buy in two Weeks Accuracy Sensitivity Specificity Considerable lower predictive power when predicting more distant future events Still room for improvement
  27. 27. 27 Python & Big Data BIG DATA
  28. 28. 28 Hive for heavy lifting • Apache project • built on top of Hadoop • SQL interface to your data • basically map&reduce abstraction layer • robust and matured • but slow and surely not “interactive” Data Team: • used for batch-processing of user preferences, user-segmentation etc. • PyHive by Dropbox for Python support • usage of Python-based UD(A)Fs
  29. 29. 29 User Defined Functions (UDFs) User defined (aggregation) functions: § needed when native functions aren‘t sufficient § are always much slower than native functions § work on a column or multiple (grouped) columns § are vector-valued operations and/or aggregations transform aggregate apply
  30. 30. 30 fast and general engine for large-scale data processing PySpark for fast analysis and machine learning + = pyspark
  31. 31. 31 Conversion Example of User Preferences Hive: • 2483 lines of code • Jinja2 to generate SQL queries • Temporary tables for performance • Runtime 5-10h • Logic hard to understand at times Spark: • 1745 lines of code • programatic definition of queries • No temporary tables needed • Runtime 1-2 h • Quite easy to understand Looking For: Used Car (100%) Prefers (Make): BMW (50%), Audi (50%) Prefers (Model): Audi A3 (25%), Audi A4 (25%), BMW 318 (50%) Searching In: lat 52.5206, lon 13.409 Search Radius: 300km Preferred Price: 20 000€ ± 1500€ Preferred Mileage: 10 000km ± 5000km User Profile Buyer Last Action: Yesterday Frequent User User 12345 Likelihood to buy: 88 %
  32. 32. 32 How Spark works e.g. Jupyter lab Source: Spark documentation
  33. 33. 33 How do Python UD(A)Fs work? Source: Spark documentation 7
  34. 34. 34 Apache Arrow Source: Arrow documentation
  35. 35. 35 PySpark & Pandas Vectorized UDFs for Spark 2.3: §build on top of Apache Arrow, §avoid high serialization and invocation overhead, §allows row-at-a-time UFDs and cumulative UDAFs §as flexible as Pandas` apply Source: databricks blog
  36. 36. 36 Performance gains Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
  37. 37. 37 But what if Spark < 2.3? It‘s possible to write flexible UD(A)Fs by •using RDD functionality, df.rdd.mapPartitions(my_func) •convert low-level Row objects to Pandas dataframe •wrap everything into a nice decorator Detailed information under: https://www.inovex.de/blog/efficient-udafs-with-pyspark/
  38. 38. 38 Isolated environments with PySpark
  39. 39. 39 Concept § create a local environment based on wheels, § upload unpacked wheels with to HDFS, § read and distribute these Python packages from the Spark driver to the executores with sc.addFile, § use the packages on the executors, e.g. in a UDF. Detailed information under: https://www.inovex.de/blog/managing-isolated-environments-with-pyspark/
  40. 40. 40 Architecture
  41. 41. 41 Summary PyData Stack Interesting & Challenging Use Cases Data Science Data Engineering Business Impact
  42. 42. 42 Any Questions?

×