Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DutchMLSchool. Introduction to Machine Learning with the BigML Platform

86 views

Published on

Introduction to Machine Learning with the BigML Platform - ML for Executives Course.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

DutchMLSchool. Introduction to Machine Learning with the BigML Platform

  1. 1. 1st edition | July 8-11, 2019 1
  2. 2. BigML, Inc #DutchMLSchool Introduction to BigML Making Machine Learning Beautifully Simple Full Name Role, Company 2 Poul Petersen CIO, BigML, Inc
  3. 3. BigML, Inc #DutchMLSchool Sampling the Audience 3 Expert: Published papers at KDD, ICML, NIPS, etc or developed own ML algorithms used at large scale Aficionado: Understands pros/cons of different techniques and/or can tweak algorithms as needed Practitioner: Very familiar with ML packages (Weka, Scikit, BigML, etc.) Newbie: Just taking Coursera ML class or reading an introductory book to ML Absolute beginner: ML sounds like science fiction
  4. 4. BigML, Inc #DutchMLSchool A Present for You 4
  5. 5. BigML, Inc #DutchMLSchool Free 1-Month PRO Subscription 5 https://bigml.com/accounts/register/ dutchmlschool
  6. 6. BigML, Inc #DutchMLSchool A Brief History of BigML 6 • BigML Mission: To make Machine Learning Beautifully Simple • BigML Founded in Corvallis, Oregon in 2011 - long before ML was "cool" • You’ve never heard of it? • Most innovative city in the United States!
  7. 7. BigML, Inc #DutchMLSchool A Brief History of BigML 7
  8. 8. BigML, Inc #DutchMLSchool BigML Platform 8 Web-based Frontend Visualizations Distributed Machine Learning Backend SOURCE SERVER DATASET SERVER MODEL SERVER PREDICTION SERVER EVALUATION SERVER SAMPLE SERVER WHIZZML SERVER Tools - https://bigml.com/tools REST API - https://bigml.com/api Smart Infrastructure (auto-deployable, auto-scalable) SERVERS EVENTS GEARMAN QUEUE DESIRED TOPOLOGY AWS COSTS RUNQUEUE SCALER BUSY SCALER AUTO TOPOLOGY AUTO TOPOLOGY AUTO TOPOLOGY AUTO TOPOLOGY ACTUAL TOPOLOGY MESSAGE QUEUE
  9. 9. BigML, Inc #DutchMLSchool BigML Platform 9 Web-based Frontend Visualizations Distributed Machine Learning Backend SOURCE SERVER DATASET SERVER MODEL SERVER PREDICTION SERVER EVALUATION SERVER SAMPLE SERVER WHIZZML SERVER Tools - https://bigml.com/tools REST API - https://bigml.com/api Smart Infrastructure (auto-deployable, auto-scalable) SERVERS EVENTS GEARMAN QUEUE DESIRED TOPOLOGY AWS COSTS RUNQUEUE SCALER BUSY SCALER AUTO TOPOLOGY AUTO TOPOLOGY AUTO TOPOLOGY AUTO TOPOLOGY ACTUAL TOPOLOGY MESSAGE QUEUE On-Premises
  10. 10. BigML, Inc #DutchMLSchool Machine Learning Motivation 10 • You are looking to buy a house • Recently found a house you like • Is the asking price fair? Imagine: What Next?
  11. 11. BigML, Inc #DutchMLSchool Machine Learning Motivation 11 Why not ask an expert? • Experts can be rare / expensive • Hard to validate experience: • Experience with similar properties? • Do they consider all relevant variables? • Knowledge of market up to date? • Hard to validate answer: • How many times expert right / wrong? • Probably can’t explain decision in detail • Humans are not good at intuitive statistics
  12. 12. BigML, Inc #DutchMLSchool Data vs Expert 12 Replace the expert with data? • Intuition: square footage relates to price. • Collect data from past sales SQFT SOLD 2424 360000 1785 307500 1003 185000 4135 600000 1676 328500 1012 247000 3352 420000 2825 435350 PRICE = 125.3*SQFT + 96535 PREDICT 400262 320195 222211 614651 306538 223339 516541 450508
  13. 13. BigML, Inc #DutchMLSchool Data vs Expert 13 Replace the expert scorecard • Experts can be rare / expensive • Hard to validate experience: • Experience with similar properties? • Do they consider all relevant variables? • Knowledge of market up to date? • Hard to validate answer: • How many times expert right / wrong? • Probably can’t explain decision in detail • Humans are not good at intuitive statistics
  14. 14. BigML, Inc #DutchMLSchool Data vs Expert 14 Replace the expert with data • Intuition: square footage relates to price. • Collect data from past sales SQFT SOLD 2424 360000 1785 307500 1003 185000 4135 600000 1676 328500 1012 247000 3352 420000 2825 435350 PRICE = 125.3*SQFT + 96535
  15. 15. BigML, Inc #DutchMLSchool More Data! 15 SQFT BEDS BATHS ADDRESS LOCATION LOT SIZE YEAR BUILT PARKING SPOTS LATITUDE LONGITUDE SOLD 2424 4 3 1522 NW Jonquil Timberhill SE 2nd 5227 1991 2 44,594828 -123,269328 360000 1785 3 2 7360 NW Valley Vw Country Estates 25700 1979 2 44,643876 -123,238189 307500 1003 2 1 2620 NW Chinaberry Tamarack Village 4792 1978 2 44,593704 -123,295424 185000 4135 5 3,5 4748 NW Veronica Suncrest 6098 2004 3 44,5929659 -123,306916 600000 1676 3 2 2842 NW Monterey Corvallis 8712 1975 2 44,5945279 -123,291523 328500 1012 3 1 2320 NW Highland Corvallis 9583 1959 2 44,591476 -123,262841 247000 3352 4 3 1205 NW Ridgewood Ridgewood 2 60113 1975 2 44,579439 -123,333888 420000 2825 3 411 NW 16th Wilkins Addition 4792 1938 1 44,570883 -123,272113 435350 Uhhhh…….. • Can we still fit a line to 10 variables? (well, yes) • Will fitting a line give good results? (unlikely) • What about those text fields and categorical values?
  16. 16. BigML, Inc #DutchMLSchool Modeling Home Prices 16
  17. 17. BigML, Inc #DutchMLSchool What just happened? 17 Home Data Square Feet? Location? Model Prediction: Price=418K
  18. 18. BigML, Inc #DutchMLSchool Some Terminology… 18 Home Data Model Prediction: Price=418K Training Data • Modeling • Clustering • Anomaly Detection • Association Discovery ML Resource ML Platform “Consume” the model or “put into production” • Dashboard • Custom Application • Wearable / Edge device • Batch Process
  19. 19. BigML, Inc #DutchMLSchool Model Choices 19 • Single Decision Tree was Easy to understand, but could we build something stronger? • There are actually hundreds of algorithms…
  20. 20. BigML, Inc #DutchMLSchool Model Choices 20
  21. 21. BigML, Inc #DutchMLSchool Model Choices 21 • Single Decision Tree was Easy to understand, but could we build something stronger? • There are actually hundreds of algorithms… • BigML carefully implements the best in terms of interpretability and the ability to work with real-world data: • Linear Regression • Logistic Regression • Single Decision Trees • Decision Forest / Random Decision Forest • Boosted Trees • Deepnets (wait - those are hard, right?)
  22. 22. BigML, Inc #DutchMLSchool Deepnets are Hard, Right? 22 x1 x2 x3 x4 y1 y2 y3Outputs Inputs h1 h2 h3 h4 h5 Hidden layer 3 Classes 4 Features h1 h2 h3 h4 h5 Hidden layer h1 h2 h3 h4 h9 Hidden layer…. h1 = activation?(wx, x) ?
  23. 23. BigML, Inc #DutchMLSchool BigML Deepnets 23 • The success of a Deepnet is dependent on getting the right network structure for the dataset • But, there are too many parameters: • Nodes, layers, activation function, learning rate, etc… • And setting them takes significant expert knowledge • Solution: Metalearning (a good initial guess) • Solution: Network search (try a bunch)
  24. 24. BigML, Inc #DutchMLSchool Model Choices 24
  25. 25. BigML, Inc #DutchMLSchool Choosing the Algorithm 25 Decreasing Interpretability / Better Representation / Longer Training IncreasingDataSize/Complexity Early Stage Rapid Prototyping Mid Stage Proven Application Late Stage Critical Performance DeepnetsSingle Tree Model Logistic Regression Boosted Trees Random Decision Forest Decision Forest STILL TO O H AR D ?
  26. 26. BigML, Inc #DutchMLSchool OptiML 26 • Each resource has several parameters that impact quality • Number of trees, missing splits, nodes, weight • Rather than trial and error, we can use ML to find ideal parameters • Why not make the model type, Decision Tree, Boosted Tree, etc, a parameter as well? • Similar to Deepnet network search, but finds the optimum machine learning algorithm and parameters for your data automatically • Outputs the top performing algorithms and parameters for your data… Why use just one “best” result?
  27. 27. BigML, Inc #DutchMLSchool Fusions 27 • Similar to an Ensemble, but we can mix different model types • Logistic Regression, plus a Deepnet for example • You can also create a fusion with different training sets! • Last week, plus last month data, etc • Or a Fusion of OptiML models • Combines the “best of the best”
  28. 28. BigML, Inc #DutchMLSchool OptiML & Fusions 28
  29. 29. BigML, Inc #DutchMLSchool ML Workflows 29 MODEL FILTERSOLD HOMES BATCH PREDICTION NEW FEATURES DATASET DEALS DATASET FILTERFORSALE HOMES NEW FEATURES • Real-world ML Applications are workflows! • Often requires unsupervised learning!
  30. 30. BigML, Inc #DutchMLSchool Let’s build a recommender 30 Typical way to shop for a home…
  31. 31. BigML, Inc #DutchMLSchool Recommender Idea 31 ? ? ? ? Preference Model Preference Data Sample … then use the Preference Model to filter all the homes on the market All Homes Forsale
  32. 32. BigML, Inc #DutchMLSchool Title 32 What if there are really unusual homes in the data? • A mansion with 20 bathrooms • A home with no bedrooms • A lot size that is smaller than the home? We don’t want to show these as suggestions because they are unusual…. How do we detect anomalies?
  33. 33. BigML, Inc #DutchMLSchool Anomaly Detection !33
  34. 34. BigML, Inc #DutchMLSchool What just happened? 34 • We wanted to find and remove unusual houses. • We created an Anomaly Detector and examined the top anomalies. • We found some unusual houses to remove and discovered bad data (missing values) that we want to fix.
  35. 35. BigML, Inc #DutchMLSchool A clever way to fix missing data 35 Let’s use Machine Learning… BEDS BATHS SQFT PRICE BEDS BATHS 3.125 US$530.000 5 3 2.100 US$460.000 2 1.200 US$250.000 3 3.950 US$610.000 6 4 4 1.5
  36. 36. BigML, Inc #DutchMLSchool WhizzML !36
  37. 37. BigML, Inc #DutchMLSchool What just happened? 37 • We had a Dataset with missing values. • We wanted to apply an algorithm to fix the missing values with Machine Learning • Rather than write the algorithm, we found what we needed in the WhizzML public gallery. • Now that we have cloned the Script we can use it again and again. • We can write new ones too!
  38. 38. BigML, Inc #DutchMLSchool Recommender Problem #2 38 • How can we avoid showing essentially the same house over and over? All Homes ? ? ? Sample Modern
  39. 39. BigML, Inc #DutchMLSchool Recommender Problem #2 39 • How can we avoid showing essentially the same house over and over? All Homes Modern Lots of Land • Great! What if we don’t know how to group them? Or how many groups? ? sample ? sample
  40. 40. BigML, Inc #DutchMLSchool Clustering 40
  41. 41. BigML, Inc #DutchMLSchool What just happened? 41 • Since we don’t know how many groups of homes there should be, we used G-means Clustering to find the optimum number of groups of homes • Our recommender will use these groups to create a better sampling for user preference • We also tried to understand the home clusters using “model clusters” but the models were difficult to interpret
  42. 42. BigML, Inc #DutchMLSchool Understanding Clusters Better 42 If SQFT >= 3,125 THEN “Cluster 1” What if we could get rules like… SQFT PRICE BEDS BATHS CLUSTER 3.125 US$530.000 5 3 Cluster 1 2.100 US$460.000 4 2 Cluster 3 1.200 US$250.000 3 1,5 Cluster 5 3.950 US$610.000 6 4 Cluster 1
  43. 43. BigML, Inc #DutchMLSchool Association Discovery !43
  44. 44. BigML, Inc #DutchMLSchool What just happened? 44 • We used a Batch Centroid to add the Cluster assignment of each home as a feature to the Dataset • We use Association Discovery to find “interesting” relationships between the features including the Cluster assignment
  45. 45. BigML, Inc #DutchMLSchool Recommender Problem #3 45 There is much more interesting information than just the number of BEDS, BATHS, etc. • Unfortunately, these "remarks" are not available in the Redfin download • Adding them to our dataset requires crawling the website • Like most ML projects, preparing the data is 80% of the difficulty (fortunately I already did it!)
  46. 46. BigML, Inc #DutchMLSchool Topic Modeling 46
  47. 47. BigML, Inc #DutchMLSchool What just happened? 47 • We extending the home dataset with the syndicated remarks text field • We built a model to predict sale price and explored how key words discovered in the remarks impacted price • We used topic modeling to create a deeper thematic understanding of the remarks • Homes that are "in-town" or "out-of-town" • We extended the dataset with fields that represent for each home how related they are to each of these topics • This will allow our clustering to group homes by a deeper meaning than just BEDS, BATHS, etc • Is there a better way to capture “locality”?
  48. 48. BigML, Inc #DutchMLSchool Idea: Better Feature 48 Worth More Worth Less
  49. 49. BigML, Inc #DutchMLSchool A Better Feature for Home Prices 49 LATITUDE LONGITUDE REFERENCE LATITUDE REFERENCE LONGITUDE 44,583 -123,296775 44,5638 -123,2794 44,604414 -123,296129 44,5638 -123,2794 44,600108 -123,29707 44,5638 -123,2794 44,603077 -123,295004 44,5638 -123,2794 44,589587 -123,301154 44,5638 -123,2794 Distance (m) 700 30,4 19,38 37,8 23,39
  50. 50. BigML, Inc #DutchMLSchool Haversine Formula 50 https://en.wikipedia.org/wiki/Haversine_formula
  51. 51. BigML, Inc #DutchMLSchool Feature Engineering 51
  52. 52. BigML, Inc #DutchMLSchool What just happened? 52 • We wanted to create a new feature “distance from OSU” • This is possible with Flatline, a DSL for feature engineering • Rather than writing the code for the coordinate transformation, we found a ready-made script shared in the WhizzML gallery • We cloned the script and transformed the dataset • This can be easily repeated with new datasets: fresh data or different cities
  53. 53. BigML, Inc #DutchMLSchool Recommender Idea 53 ? ? Modern Lots of Land Small ? ? ? ? Preference Model Preference Data
  54. 54. BigML, Inc #DutchMLSchool House Recommender 54
  55. 55. Co-organized by: Sponsor: Business Partners:

×