Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BSSML16 L7. Feature Engineering

803 views

Published on

Brazilian Summer School in Machine Learning 2016
Day 2 - Lecture 2: Feature Engineering
Lecturer: Poul Petersen (BigML)

Published in: Data & Analytics
  • Diabetes is Now a Thing of the Past! A completely new and readily available solution may now be found below! With it you no longer have to worry about all the horrors formerly associated with this dreadful and merciless disease! Just go now to the link immediately below for the full facts: ♥♥♥ http://t.cn/A6vI6Tyi
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

BSSML16 L7. Feature Engineering

  1. 1. D E C E M B E R 8 - 9 , 2 0 1 6
  2. 2. BigML, Inc 2 Poul Petersen CIO, BigML, Inc. Feature Engineering Creating Machine Learning Ready Data
  3. 3. BigML, Inc 3Feature Engineering Machine Learning Secret “…the largest improvements in accuracy often came from quick experiments, feature engineering, and model tuning rather than applying fundamentally different algorithms.” Facebook FBLearner 2016 Feature Engineering: applying domain knowledge of the data to create features that make machine learning algorithms work better or at all.
  4. 4. BigML, Inc 4Feature Engineering Obstacles • Data Structure • Scattered across systems • Wrong "shape" • Unlabelled data • Data Value • Format: spelling, units • Missing values • Non-optimal correlation • Non-existant correlation • Data Significance • Unwanted: PII, Non-Preferred • Expensive to collect • Insidious: Leakage, obviously correlated Data Transformation Feature Engineering Feature Selection
  5. 5. BigML, Inc 5Feature Engineering Feature Engineering 2013-09-25 10:02 Automatic Date Transformation … year month day hour minute … … 2013 Sep 25 10 2 … … … … … … … … NUM NUMCAT NUM NUM DATE-TIME
  6. 6. BigML, Inc 6Feature Engineering Feature Engineering Automatic Categorical Transformation … alchemy_category … … business … … recreation … … health … … … … CAT business health recreation … … 1 0 0 … … 0 0 1 … … 0 1 0 … … … … … … NUM NUM NUM
  7. 7. BigML, Inc 7Feature Engineering Feature Engineering Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon ‘em. TEXT Automatic Text Transformation … great afraid born achieve … … 4 1 1 1 … … … … … … … NUM NUM NUM NUM
  8. 8. BigML, Inc 8Feature Engineering Feature Engineering { “url":"cbsnews", "title":"Breaking News Headlines Business Entertainment World News “, "body":" news covering all the latest breaking national and world news headlines, including politics, sports, entertainment, business and more.” } Fixing "non-optimal correlations" title body Breaking News… news covering… … … TEXT TEXT TEXT
  9. 9. BigML, Inc 9Feature Engineering Feature Engineering Discretization Total Spend 7.342,99 304,12 4,56 345,87 8.546,32 NUM “Predict will spend $3,521 with error $1,232” Spend Category Top 33% Bottom 33% Bottom 33% Middle 33% Top 33% CAT “Predict customer will be Top 33% in spending”
  10. 10. BigML, Inc 10Feature Engineering Feature Engineering Combinations of Multiple Features Kg M2 101,4 3,24 85,2 2,8 56,2 2,9 136,1 3,6 95,9 4,1 NUM NUM BMI 31,29 30,42 19,38 37,81 23,39 NUM Kg M2
  11. 11. BigML, Inc 11Feature Engineering Feature Engineering Flatline • BigML’s Domain-Specific Language (DSL) for Transforming Datasets • Limited programming language structures • let, cond, if, maps, list operators, */+- • Dataset Fields are first-class citizens • (field “diabetes pedigree”) • Built-in transformations • statistics, strings, timestamps, windows
  12. 12. BigML, Inc 12Basic Transformations Data Labelling Data may not have labels needed for doing classification Create specific metrics for adding labels Name Month - 3 Month - 2 Month - 1 Joe Schmo 123,23 0 0 Jane Plain 0 0 0 Mary Happy 0 55,22 243,33 Tom Thumb 12,34 8,34 14,56 Un-Labelled Data Labelled data Name Month - 3 Month - 2 Month - 1 Default Joe Schmo 123,23 0 0 FALSE Jane Plain 0 0 0 TRUE Mary Happy 0 55,22 243,33 FALSE Tom Thumb 12,34 8,34 14,56 FALSE (= 0 (+ (abs ( f "Month - 3" ) ) (abs ( f "Month - 2")) (abs ( f "Month - 1") ) ))
  13. 13. BigML, Inc 13Feature Engineering Feature Engineering (/ (- ( f "price") (avg-window "price" -4, -1)) (standard-deviation "price")) date volume price 1 34353 314 2 44455 315 3 22333 315 4 52322 321 5 28000 320 6 31254 319 7 56544 323 8 44331 324 9 81111 287 10 65422 294 11 59999 300 12 45556 302 13 19899 301 14 21453 302 day-4 day-3 day-2 day-1 4davg - 314 - 314 315 - 314 315 315 - 314 315 315 321 316,25 315 315 321 320 317,75 315 321 320 319 318,75 Current - (4-day avg) std dev Shock: Deviations from a Trend
  14. 14. BigML, Inc 14Feature Engineering Feature Engineering (/ (- (f "price") (avg-window "price" -4, -1)) (standard-deviation "price")) Current - (4-day avg) std dev Shock: Deviations from a Trend Current : (field “price”) 4-day avg: (avg-window “price” -4 -1) std dev: (standard-deviation “price”)
  15. 15. BigML, Inc 15Feature Engineering Feature Engineering Moon Phase% ( / ( mod ( - ( / ( epoch ( field {{date-field}} )) 1000 ) 621300 ) 2551443 ) 2551442 )
  16. 16. BigML, Inc 16Feature Engineering Feature Engineering Fixing "non-existant correlations" Highway Number Direction Is Long 2 East-West FALSE 4 East-West FALSE 5 North-South TRUE 8 East-West FALSE 10 East-West TRUE … … … Goal: Predict principle direction from highway number ( = (mod (field "Highway Number") 2) 0)
  17. 17. BigML, Inc 17Feature Engineering Feature Engineering Fix Missing Values in a “Meaningful” Way Filter Zeros Model 
 insulin Predict 
 insulin Select 
 insulin Fixed
 Dataset Amended
 Dataset Original
 Dataset Clean
 Dataset ( if ( = (field "insulin") 0) (field "predicted insulin") (field "insulin"))
  18. 18. BigML, Inc 18 Feature Selection
  19. 19. BigML, Inc 19Feature Engineering Feature Selection • Model Summary • Field Importance • Algorithmic • Best-First Feature Selection • Boruta • Leakage • Tight Correlations (AD, Plot, Correlations) • Test Data • Perfect future knowledge cat diabetes.csv diabetes_testset.csv | sort | uniq -d | wc -l
  20. 20. BigML, Inc 20 Evaluate & Automate
  21. 21. BigML, Inc 21Feature Engineering Evaluate & Automate • Evaluate • Did you meet the goal? • If not, did you discover something else useful? • If not, start over • If you did… • Automate - You don’t want to hand code that every time, right? • Consider tools that are easy to automate • scripting interface • APIs • Ability to maintenance is important
  22. 22. BigML, Inc 22Feature Engineering The Process Data Transform Define Goal Model & Evaluate no yes Better Data Not Possible Tune Algorithm Goal Met? Automate Feature Engineer & Selection Better
 Features

×