Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)


Published on

Customers are adopting Apache Spark ‒ an open-source distributed processing framework ‒ on Amazon EMR for large-scale machine learning workloads, especially for applications that power customer segmentation and content recommendation. By leveraging Spark ML, a set of machine learning algorithms included with Spark, customers can quickly build and execute massively parallel machine learning jobs. Additionally, Spark applications can train models in streaming or batch contexts, and can access data from Amazon S3, Amazon Kinesis, Amazon Redshift, and other services. This session explains how to quickly and easily create scalable Spark clusters with Amazon EMR, build and share models using Apache Zeppelin and Jupyter notebooks, and use the Spark ML pipelines API to manage your training workflow. In addition, Jasjeet Thind, Senior Director of Data Science and Engineering at Zillow Group, will discuss his organization's development of personalization algorithms and platforms at scale using Spark on Amazon EMR.

Published in: Technology

AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jonathan Fritz, Sr. Product Manager, Amazon EMR Jasjeet Thind, Sr. Director, Data Science & Engineering, Zillow Group November 29, 2016 MAC303 Zillow Group: Developing Classification and Recommendation Engines With Amazon EMR and Apache Spark
  2. 2. What to Expect from the Session • Apache Spark and Spark ML overview • Running Spark ML on Amazon EMR • Interactive notebook options • Building recommendation engines at Zillow Group
  3. 3. Spark for fast processing join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: = cached partition= RDD map • Massively parallel • Uses DAGs instead of map- reduce for execution • Minimizes I/O by storing data in DataFrames in memory • Partitioning-aware to avoid network-intensive shuffle
  4. 4. Spark components to match your use case
  5. 5. Spark ML addresses the full ML pipeline - Built on top of DataFrame API - Extract, transform, and select features - Distributed algorithms - Classification and Regression - Clustering - Collaborative Filtering - Model selection tools - Pipelines Process Data Feature Extraction Model Training Model Testing Model Validation
  6. 6. Extracting features in DataFrames - Feature Extractors - CountVectorizer - Feature Transformers - Tokenizer - Binarizer - StandardScaler - Feature Selectors - VectorSlicer
  7. 7. Many storage layers to choose from Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift Amazon S3 Amazon EMR
  8. 8. Training data Bank loan write-off predictions
  9. 9. Classification algorithms in Spark ML - Logistic regression - Decision tree classifier - Random forest classifier - Gradient-boosted tree classifier - Multilayer perceptron classifier - One-vs-Rest classified - Naive Bayes
  10. 10. What is logistic regression?
  11. 11. What are decision trees? Weather predictors for Golf
  12. 12. Decision trees: tree induction
  13. 13. Decision trees: partition data with hyperplanes
  14. 14. Spark ML pipelines - training
  15. 15. Spark ML pipelines - testing
  16. 16. Creating a Spark ML pipeline val pipeline = new Pipeline().setStages(Array(assembler, indexer, dt)) val model = val predictions = model.transform(df) Save and load machine learning models and full Pipelines
  17. 17. Tools to pick the right model - CrossValidator and TrainValidationSplit select the Model produced by the best-performing set of parameters - Split the input data into separate training and test datasets - For each (training, test) pair, iterate through the set of ParamMaps - Fit the Estimator using those parameters, get the fitted Model, and evaluate the Model’s performance using the Evaluator
  18. 18. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Secure Easy to manage options Flexible Customize the cluster
  19. 19. Develop fast using notebooks and IDEs
  20. 20. • Run Spark Driver in Client or Cluster mode • Spark application runs as a YARN application • SparkContext runs as a library in your program, one instance per Spark application. • Spark Executors run in YARN Containers on NodeManagers in your cluster • Access Spark UI through the Resource Manager or Spark History Server Spark on YARN Spark UI
  21. 21. Monitor your Spark jobs
  22. 22. Auto Scaling for data science on-demand YARN metrics
  23. 23. Coming soon: advanced Spot provisioning Master Node Core Instance Fleet Task Instance Fleet • Provision from a list of instance types with Spot and On-Demand • Launch in the most optimal AZ based on capacity/price • Spot Block support
  24. 24. Productionizing your pipeline Amazon EMR Step API Submit a Spark application Amazon EMR AWS Data Pipeline Airflow, Luigi, or other schedulers on EC2 Create a pipeline to schedule job submission or create complex workflows AWS Lambda Use AWS Lambda to submit applications to EMR Step API or directly to Spark on your cluster
  25. 25. Recommendation Systems @ Zillow Group Jasjeet Thind Sr Director, Data Science & Engineering
  26. 26. Agenda Intro to Zillow Group Recommendation Use Cases Architecture Algorithms Training & Scoring Pipeline Metrics
  27. 27. Zillow Group Build the world's largest, most trusted, and vibrant home-related marketplace.
  28. 28. Recommendation use cases Email - homes for sale / for rent Home Details - homes for sale / homes like this Personalized Search Mobile - smart SMS and push notifications Home owner / pre-seller predictions Lender selection algorithm Similar photos / video
  29. 29. Architecture RECOMMENDATION API (Python, R, Flask) Zillow Group Data Lake (S3 / Kinesis) Property Featurization (Spark EMR) User Profiles (Spark EMR) Ranking (Spark EMR) Wedge Counting Collaborative Filtering (Spark EMR) Property Aggregate Features (Spark EMR) Data Collection Systems (Java/Python/SQL)
  30. 30. Like vs. dislike Predict homes per user using behavior of similar users Like = user actively engaged with property Dislike = user viewed property but weak engagement $22M $19M $664K ?+ + - + - Spencer Stan Feature Description uid unique id of user pid Property id first_visit timestamp or 0 num_views sigmoid(#views) time_spent time on page num_contacts # leads sent num_saves # saves on zpid num_shares # shares on zpid num_photos # photos viewed
  31. 31. Wedge count For all user & property pairs to form a prediction, perform wedge count - Does Stan like $19M? Wedge # 3 (wedge03_cnt ) 5 (wedge05_cnt ) $22M + - $19M + ? Spencer Stan $664k - + $19M + ? Spencer Stan
  32. 32. Classifier Gradient Boosting Classifier (sklearn) Popular users / properties: - Divide wedge counts by degree product ju * ki Prediction for all user / property pairs, limit candidate set by - Top 10 zip codes - 300 properties per user features wedge00_cnt wedge01_cnt wedge02_cnt wedge03_cnt wedge04_cnt wedge05_cnt wedge06_cnt wedge07_cnt wedge00_norm_cnt wedge01_norm_cnt wedge02_norm_cnt wedge03_norm_cnt wedge04_norm_cnt wedge05_norm_cnt wedge06_norm_cnt wedge07_norm_cnt Does Stan like the $19M home? features (uid: Stan, pid: $19M) (see right side)
  33. 33. User profile Signals - website, mobile app, and search queries Binary classification - labels (like/dislike) same as collab filtering model User profile model determines preference scores Features (categorical variables) Bath 0_bath, 0.5_bath, 1_Bath, 1.5_bath, 2_bath, 2.5_bath, 3_bath Bed 0_bed, 1_bed, 2_bed, 3_bed, 4_bed, 5_bed Price 100_125_price, 125_150_price, 150_175_price Use Code condo, single_family, farm_land Zipcode zip_98109 pid uid features label 0 or 1 - see right side 0 or 1 0_bed: 0 1_bed: 0.01 2_bed: 0.8 3_bed: 0.6
  34. 34. Ranking Property matrix - feature space same as user profile Dot product of property matrix with user profile vector Age decay for older listings (uid, pid) score {"uId":"10307499", "pId":"1044183744"} 0.3364 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0.01 0.8 0.6 0_bed 1_bed 2_bed 3_bed uid_0 pid_0 pid_1 pid_2 pid_3 = 0 0.8 0 0.6
  35. 35. Training & scoring Collect user behavior and real-estate data, train the various models, generate the candidate set, and make predictions. User Behavior (Kinesis /S3) Public Record (Kinesis / S3) Event API (Java) Producer (Python) Filter (Spark) User Store (Hive / S3) Spark job creates Hive table with user events (uid, pid) partitioned by date Active Listings (Kinesis / S3) Producer (Python) Training Data (Spark) Training Set (Hive / S3) pid -> uid reverse index Past and current user events Models (Python) Train Models (Spark) Score (Spark) Recommendations Property Data Collaborative Filtering / User Profile Models Hashmap (Redis) Wedge features or property features (user profile)
  36. 36. Offline evaluation Hyperparameter tuning with validation set Training/test data sets for model evaluation Offline Metrics Description Precision rk = # recommended properties in test set in top k Recall n = total properties in the test set Freshness # listings recommended w/ modified date < y day old in top k Coverage # unique listings recommended across all users / total # unique listings
  37. 37. Future work Classifiers for listing descriptions Deep learning on listing images Structured streaming on Spark 2.0 Cross-brand user signals - Zillow, Trulia, Hotpads, & StreetEasy Real-time scoring
  38. 38. Thank you! Come join us @ Zillow Group! Hiring: - SDE, ML, Data Scientist - Big Data Engineer - Analytic Engineer - Product Management
  39. 39. Remember to complete your evaluations!
  40. 40. Related Sessions