Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Valencian Summer School in Machine Learning
3rd edition
September 14-15, 2017
BigML, Inc 2
Basic Transformations
Making Data Machine Learning Ready
Poul Petersen
CIO, BigML, Inc
BigML, Inc 3Basic Transformations
In a Perfect World…
Q: How does a physicist milk a cow?
A: Well, first let us consider a ...
BigML, Inc 4Basic Transformations
The Dream
CSV Dataset Model Profit!
BigML, Inc 5Basic Transformations
The Reality
CRM
Web Accounts
Transactions
ML Ready?
BigML, Inc 6Basic Transformations
Obstacles
• Data Structure
• Scattered across systems
• Wrong "shape"
• Unlabelled data
...
BigML, Inc 7Basic Transformations
The Process
• Define a clear idea of the goal.
• Sometimes this comes later…
• Understand...
BigML, Inc 8Basic Transformations
Data Transformations
BigML, Inc 9Basic Transformations
BigML Tasks
Goal
• Will this customer default on a
loan?
• How many customers will apply...
BigML, Inc 10Basic Transformations
Classification
CategoricalTrainingTesting
Predicting
BigML, Inc 11Basic Transformations
Regression
NumericTrainingTesting
Predicting
BigML, Inc 12Basic Transformations
Anomaly Detection
BigML, Inc 13Basic Transformations
Cluster Analysis
BigML, Inc 14Basic Transformations
Association Discovery
BigML, Inc 15Basic Transformations
ML Ready DataInstances
Fields	
  (Features)
Tabular Data (rows and columns):
• Each row...
BigML, Inc 16Basic Transformations
Data Labeling
Unsupervised	
  Learning Supervised	
  Learning
• Anomaly Detection
• Clu...
BigML, Inc 17Basic Transformations
Data Labelling
Data is often not labeled
Create labels with a transformation
Name Month...
BigML, Inc 18Basic Transformations
SF Restaurants Example
https://data.sfgov.org/Health-and-Social-Services/Restaurant-Sco...
BigML, Inc 19Basic Transformations
Transformations Demo #1
BigML, Inc 20Basic Transformations
Data Cleaning
Homogenize missing values and different types in the same
feature, fix in...
BigML, Inc 21Basic Transformations
Transformations Demo #2
BigML, Inc 22Basic Transformations
Define a Goal
• Predict rating: Poor / Needs Improvement / Adequate /
Good
• This is a c...
BigML, Inc 23Basic Transformations
Denormalizing
business
inspections
violations
scores
Instances
Features
(millions)
join...
BigML, Inc 24Basic Transformations
Transformations Demo #3
BigML, Inc 25Basic Transformations
Structuring Output
• A CSV file uses plain text to store tabular data.
• In a CSV file, e...
BigML, Inc 26Basic Transformations
Transformations Demo #4
BigML, Inc 27Basic Transformations
Define a Goal
• Predict rating: Poor / Needs Improvement / Adequate / Good
• This is a c...
BigML, Inc 28Basic Transformations
Aggregating
User Num.Playbacks Total Time Pref.Device
User001 3 830 Tablet
User002 1 21...
BigML, Inc 29Basic Transformations
Transformations Demo #5
BigML, Inc 30Basic Transformations
Pivoting
Different values of a feature are pivoted to new columns in the
result dataset...
BigML, Inc 31Basic Transformations
Time Windows
Create new features using values over different periods of time
Instances
...
BigML, Inc 32Basic Transformations
Transformations Demo #6
BigML, Inc 33Basic Transformations
Updates
Need a current view of the data, but new data only comes in
batches of changes
...
BigML, Inc 34Basic Transformations
Streaming
Data only comes in single changes
data	
  stream
Instances
Features
Stream
Ba...
BigML, Inc 35Basic Transformations
Prosper Loan Life Cycle
Submit
Cancelled Withdraw Expired
FundedBids Current
Q: Which n...
BigML, Inc 36Basic Transformations
Prosper Example
D a t a P ro v i d e d i n X M L
updates!!
export.sh
fetch.sh
“curl”
da...
BigML, Inc 37Basic Transformations
Prosper Example
• XML… yuck!
• MongoDB has CSV export and is record based so it is easy...
BigML, Inc 38Basic Transformations
Tools
BigML, Inc 39Basic Transformations
Tools
• Command Line?
• join, cut, awk, sed, sort, uniq
• Automation
• Shell, Python, c...
BigML, Inc 40Basic Transformations
Talend
https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-my...
BigML, Inc 41Basic Transformations
Summary
• Data is awful
• Requires clean-up
• Transformations
• Consumes an enormous pa...
BigML, Inc 2
Feature Engineering
Creating Features that Make Machine Learning Work
Poul Petersen
CIO, BigML, Inc
BigML, Inc 3Feature Engineering
what is Feature Engineering
• This is really, really important - more than algorithm selec...
BigML, Inc 4Feature Engineering
Built-in Transformations
2013-09-25 10:02
Date-Time Fields
… year month day hour minute …
...
BigML, Inc 5Feature Engineering
Built-in Transformations
Categorical Fields for Clustering/LR
… alchemy_category …
… busin...
BigML, Inc 6Feature Engineering
Built-in Transformations
Be not afraid of greatness:
some are born great, some achieve
gre...
BigML, Inc 7Feature Engineering
Help ML to Work Better
{
“url":"cbsnews",
"title":"Breaking News Headlines
Business Entert...
BigML, Inc 8Feature Engineering
FE Demo #1
BigML, Inc 9Feature Engineering
Help ML to Work at all
When the pattern does not exist
Highway Number Direction Is Long
2 ...
BigML, Inc 10Feature Engineering
FE Demo #2
BigML, Inc 11Feature Engineering
Feature Engineering
Discretization
Total Spend
7,342.99
304.12
4.56
345.87
8,546.32
NUM
“...
BigML, Inc 12Feature Engineering
FE Demo #3
BigML, Inc 13Feature Engineering
Built-ins for FE
• Discretize: Converts a numeric value to categorical
• Replace missing ...
BigML, Inc 14Feature Engineering
Flatline Add Fields
Computing with Existing Features
Debt Income
10,134 100,000
85,234 13...
BigML, Inc 15Feature Engineering
FE Demo #4
BigML, Inc 16Feature Engineering
What is Flatline?
• DSL:
• Invented by BigML - Programmatic / Optimized for speed
• Trans...
BigML, Inc 17Feature Engineering
Flatline
• Lisp style syntax: Operators come first
• Correct: (+	
  1	
  2) => NOT Correct...
BigML, Inc 18Feature Engineering
Flatline s-expressions
(=	
  0	
  (+	
  (abs	
  (	
  f	
  "Month	
  -­‐	
  3"	
  )	
  )	
...
BigML, Inc 19Feature Engineering
FE Demo #5
BigML, Inc 20Feature Engineering
Flatline s-expressions
date volume price
1 34353 314
2 44455 315
3 22333 315
4 52322 321
...
BigML, Inc 21Feature Engineering
Flatline s-expressions
Current	
  -­‐	
  (4-­‐day	
  avg)	
  
std	
  dev
Shock: Deviation...
BigML, Inc 22Feature Engineering
FE Demo #6
BigML, Inc 23Feature Engineering
Advanced s-expressions
Moon Phase%
(	
  /	
  (	
  mod	
  (	
  -­‐	
  (	
  /	
  (	
  epoch...
BigML, Inc 24Feature Engineering
WhizzML + Flatline
HAVERSINE
FLATLINE
OUTPUT
DATASET
INPUT
DATASET
LONG Ref
LAT Ref
WHIZZ...
BigML, Inc 25Feature Engineering
Feature Engineering
Fix Missing Values in a “Meaningful” Way
F i l t e r
Zeros
Model 

in...
BigML, Inc 26Feature Engineering
FE Demo #7
BigML, Inc 27Feature Engineering
Feature Selection
BigML, Inc 28Feature Engineering
Feature Selection
• Model Summary
• Field Importance
• Algorithmic
• Best-First Feature S...
BigML, Inc 29Feature Engineering
Feature Selection
• Sales pipeline where step n-1 has no other outcome then
step n.
• Sto...
BigML, Inc 30Feature Engineering
Evaluate & Automate
BigML, Inc 31Feature Engineering
Evaluate & Automate
• Evaluate
• Did you meet the goal?
• If not, did you discover someth...
BigML, Inc 32Feature Engineering
The Process
Data
Transform
Define Goal
Model &
Evaluate
no
yes
Better

Data
Not

Possible
...
BigML, Inc 33Feature Engineering
Summary
• Feature Engineering: what is it / why it is important
• Automatic transformatio...
VSSML17 L5. Basic Data Transformations and Feature Engineering
Upcoming SlideShare
Loading in …5
×

VSSML17 L5. Basic Data Transformations and Feature Engineering

632 views

Published on

Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 5: Basic Data Transformations and Feature Engineering. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017

Published in: Data & Analytics
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Nothing short of a miracle! I'm writing on behalf of my husband to send you a BIG THANK YOU!! The improvement has been amazing. Peter's sleep apnea was a huge worry for both of us, and it left us both feeling tired and drowsy every morning. What you've discovered here is nothing short of a miracle. God bless you.  http://t.cn/AigiCT7Q
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

VSSML17 L5. Basic Data Transformations and Feature Engineering

  1. 1. Valencian Summer School in Machine Learning 3rd edition September 14-15, 2017
  2. 2. BigML, Inc 2 Basic Transformations Making Data Machine Learning Ready Poul Petersen CIO, BigML, Inc
  3. 3. BigML, Inc 3Basic Transformations In a Perfect World… Q: How does a physicist milk a cow? A: Well, first let us consider a spherical cow... Q: How does a data scientist build a model? A: Well, first let us consider perfectly formatted data…
  4. 4. BigML, Inc 4Basic Transformations The Dream CSV Dataset Model Profit!
  5. 5. BigML, Inc 5Basic Transformations The Reality CRM Web Accounts Transactions ML Ready?
  6. 6. BigML, Inc 6Basic Transformations Obstacles • Data Structure • Scattered across systems • Wrong "shape" • Unlabelled data • Data Value • Format: spelling, units • Missing values • Non-optimal correlation • Non-existant correlation • Data Significance • Unwanted: PII, Non-Preferred • Expensive to collect • Insidious: Leakage, obviously correlated Data Transformation Feature Engineering Feature Selection
  7. 7. BigML, Inc 7Basic Transformations The Process • Define a clear idea of the goal. • Sometimes this comes later… • Understand what ML tasks will achieve the goal. • Transform the data • where is it, how is it stored? • what are the features? • can you access it programmatically? • Feature Engineering: transform the data you have into the data you actually need. • Evaluate: Try it on a small scale • Accept that you might have to start over…. • But when it works, automate it!!!!
  8. 8. BigML, Inc 8Basic Transformations Data Transformations
  9. 9. BigML, Inc 9Basic Transformations BigML Tasks Goal • Will this customer default on a loan? • How many customers will apply for a loan next month? • Is the consumption of this product unusual? • Is the behavior of the customers similar? • Are these products purchased together? ML Task Classification Regression Anomaly Detection Cluster Analysis Association Discovery
  10. 10. BigML, Inc 10Basic Transformations Classification CategoricalTrainingTesting Predicting
  11. 11. BigML, Inc 11Basic Transformations Regression NumericTrainingTesting Predicting
  12. 12. BigML, Inc 12Basic Transformations Anomaly Detection
  13. 13. BigML, Inc 13Basic Transformations Cluster Analysis
  14. 14. BigML, Inc 14Basic Transformations Association Discovery
  15. 15. BigML, Inc 15Basic Transformations ML Ready DataInstances Fields  (Features) Tabular Data (rows and columns): • Each row • is one instance. • contains all the information about that one instance. • Each column • is a field that describes a property of the instance.
  16. 16. BigML, Inc 16Basic Transformations Data Labeling Unsupervised  Learning Supervised  Learning • Anomaly Detection • Clustering • Association Discovery • Classification • Regression The only difference, in terms of ML-Ready structure is the presence of a "label"
  17. 17. BigML, Inc 17Basic Transformations Data Labelling Data is often not labeled Create labels with a transformation Name Month - 3 Month - 2 Month - 1 Joe Schmo 123.23 0 0 Jane Plain 0 0 0 Mary Happy 0 55.22 243.33 Tom Thumb 12.34 8.34 14.56 Un-­‐Labelled  Data Labelled  data Name Month - 3 Month - 2 Month - 1 Default Joe Schmo 123.23 0 0 FALSE Jane Plain 0 0 0 TRUE Mary Happy 0 55.22 243.33 FALSE Tom Thumb 12.34 8.34 14.56 FALSE Can be done at Feature Engineering step as well
  18. 18. BigML, Inc 18Basic Transformations SF Restaurants Example https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores/stya-26eb https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/ create database sf_restaurants; use sf_restaurants; create table businesses (business_id int, name varchar(1000), address varchar(1000), city varchar(1000), state varchar(100), postal_code varchar(100), latitude varchar(100), longitude varchar(100), phone_number varchar(100)); load data local infile './businesses.csv' into table businesses fields terminated by ',' enclosed by '"' lines terminated by 'rn' ignore 1 lines; create table inspections (business_id int, score varchar(10), idate varchar(8), itype varchar(100)); load data local infile './inspections.csv' into table inspections fields terminated by ',' enclosed by '"' lines terminated by 'rn' ignore 1 lines; create table violations (business_id int, vdate varchar(8), description varchar(1000)); load data local infile './violations.csv' into table violations fields terminated by ',' enclosed by '"' lines terminated by 'rn' ignore 1 lines; create table legend (Minimum_Score int, Maximum_Score int, Description varchar(100)); load data local infile './legend.csv' into table legend fields terminated by ',' enclosed by '"' lines terminated by 'rn' ignore 1 lines;
  19. 19. BigML, Inc 19Basic Transformations Transformations Demo #1
  20. 20. BigML, Inc 20Basic Transformations Data Cleaning Homogenize missing values and different types in the same feature, fix input errors, correct semantic issues, types, etc. Name Date Duration (s) Genre Plays Highway star 1984-05-24 - Rock 139 Blues alive 1990/03/01 281 Blues 239 Lonely planet 2002-11-19 5:32s Techno 42 Dance, dance 02/23/1983 312 Disco N/A The wall 1943-01-20 218 Reagge 83 Offside down 1965-02-19 4 minutes Techno 895 The alchemist 2001-11-21 418 Bluesss 178 Bring me down 18-10-98 328 Classic 21 The scarecrow 1994-10-12 269 Rock 734 Original  data Name Date Duration (s) Genre Plays Highway star 1984-05-24 Rock 139 Blues alive 1990-03-01 281 Blues 239 Lonely planet 2002-11-19 332 Techno 42 Dance, dance 1983-02-23 312 Disco The wall 1943-01-20 218 Reagge 83 Offside down 1965-02-19 240 Techno 895 The alchemist 2001-11-21 418 Blues 178 Bring me down 1998-10-18 328 Classic 21 The scarecrow 1994-10-12 269 Rock 734 Cleaned  data update violations set description = substr(description,1,instr(description,' [ date violation corrected:')-1) where instr(description,' [ date violation corrected:') > 0;
  21. 21. BigML, Inc 21Basic Transformations Transformations Demo #2
  22. 22. BigML, Inc 22Basic Transformations Define a Goal • Predict rating: Poor / Needs Improvement / Adequate / Good • This is a classification problem • Based on business profile: • Description: kitchen, cafe, etc. • Location: zip, latitude, longitude
  23. 23. BigML, Inc 23Basic Transformations Denormalizing business inspections violations scores Instances Features (millions) join Data is usually normalized in relational databases, ML-Ready datasets need the information de-normalized in a single dataset. create table scores select * from businesses left join inspections using (business_id); create table scores_last select a.* from scores as a JOIN (select business_id,max(idate) as idate from scores group by business_id) as b where a.business_id=b.business_id and a.idate=b.idate; Denormalize ML-­‐Ready:  Each  row  contains  all  the  information  about  that  one  instance.   create table scores_last_label select scores_last.*, Description as score_label from scores_last join legend on score <= Maximum_Score and score >= Minimum_Score; Add  Label
  24. 24. BigML, Inc 24Basic Transformations Transformations Demo #3
  25. 25. BigML, Inc 25Basic Transformations Structuring Output • A CSV file uses plain text to store tabular data. • In a CSV file, each row of the file is an instance. • Each column in a row is usually separated by a comma (,) but other "separators" like semi-colon (;), colon (:), pipe (|), can also be used. Each row must contain the same number of fields • but they can be null • Fields can be quoted using double quotes ("). • Fields that contain commas or line separators must be quoted. • Quotes (") in fields must be doubled (""). • The character encoding must be UTF-8 • Optionally, a CSV file can use the first line as a header to provide the names of each field. After all the data transformations, a CSV (“Comma-Separated Values) file has to be generated, following the rules below: select * from scores_last_label into outfile "./scores_last_label.csv"; select 'name', 'address', 'city', 'state', 'postal_code', 'latitude', 'longitude', 'score_label' UNION select name, address, city, state, postal_code, latitude, longitude, score_label from scores_last_label into outfile "./scores_last_label_headers.csv" ;
  26. 26. BigML, Inc 26Basic Transformations Transformations Demo #4
  27. 27. BigML, Inc 27Basic Transformations Define a Goal • Predict rating: Poor / Needs Improvement / Adequate / Good • This is a classification problem • Based on business profile: • Description: kitchen, restaurant, etc. • Location: zip code, latitude, longitude • Number of violations, text of violations
  28. 28. BigML, Inc 28Basic Transformations Aggregating User Num.Playbacks Total Time Pref.Device User001 3 830 Tablet User002 1 218 Smartphone User003 3 1019 TV User005 2 521 Tablet Aggregated data (list of users) When the entity to model is different from the provided data, an aggregation to get the entity might be needed. Content Genr e Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Tech no 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reag ge 218 2015-05-14 09:02:55 User002 Smartphone Offside down Tech no 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Class ic 328 2015-05-15 06:59:56 User001 Tablet The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone Original data (list of playbacks) create table violations_aggregated select business_id,count(*) as violation_num,group_concat(description) as violation_txt from violations group by business_id; create table scores_last_label_violations select * from scores_last_label left join violations_aggregated USING (business_id); tail -n+2 playlists.csv | cut -d',' -f5 | sort | uniq -c tail -n+2 playlist.csv | awk -F',' '{arr[$5]+=$3} END {for (i in arr) {print arr[i],i}}' SET @@group_concat_max_len = 15000 select 'name', 'address', 'city', 'state', 'postal_code', 'latitude', 'longitude', 'violation_num', 'violation_txt', 'score_label' UNION select name, address, city, state, postal_code, latitude, longitude, violation_num, violation_txt, score_label from scores_last_label_violations into outfile "./scores_last_label_violations_headers.csv" ;
  29. 29. BigML, Inc 29Basic Transformations Transformations Demo #5
  30. 30. BigML, Inc 30Basic Transformations Pivoting Different values of a feature are pivoted to new columns in the result dataset. Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Smartphone Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone Original data User Num.Playback s Total Time Pref.Device NP_TV NP_Tablet NP_Smartphone TT_TV TT_Tablet TT_Smartphone User001 3 830 Tablet 1 2 0 190 640 0 User002 1 218 Smartphone 0 0 1 0 0 218 User003 3 1019 TV 2 0 1 750 0 269 User005 2 521 Tablet 0 2 0 0 521 0 Aggregated data with pivoted columns
  31. 31. BigML, Inc 31Basic Transformations Time Windows Create new features using values over different periods of time Instances Features Time Instances Features (millions) (thousands) t=1 t=2 t=3 create table scores_2013 select a.business_id, a.score as score_2013, a.idate as idate_2013 from inspections as a JOIN ( select business_id, max(idate) as idate from inspections where substr(idate,1,4) = "2013" group by business_id) as b where a.business_id = b.business_id and a.idate = b.idate; create table scores_over_time select * from businesses left join scores_2013 USING (business_id) left join scores_2014 USING (business_id);
  32. 32. BigML, Inc 32Basic Transformations Transformations Demo #6
  33. 33. BigML, Inc 33Basic Transformations Updates Need a current view of the data, but new data only comes in batches of changes day  1day  2day  3 Instances Features
  34. 34. BigML, Inc 34Basic Transformations Streaming Data only comes in single changes data  stream Instances Features Stream Batch (kafka, etc)
  35. 35. BigML, Inc 35Basic Transformations Prosper Loan Life Cycle Submit Cancelled Withdraw Expired FundedBids Current Q: Which new listings make it to funded? Q: Which funded loans make it to paid? Q: If funded, what will be the rate? Classification Regression Classification Goal ML Task Defaulted Paid Late Listings Loans
  36. 36. BigML, Inc 36Basic Transformations Prosper Example D a t a P ro v i d e d i n X M L updates!! export.sh fetch.sh “curl” daily import.py XML bigml.sh Model Predict Share in gallery Status LoanStatus BorrowerRate Denormalization with join
  37. 37. BigML, Inc 37Basic Transformations Prosper Example • XML… yuck! • MongoDB has CSV export and is record based so it is easy to handle changing data structure. • Feature Engineering • There are 5 different classes of “bad” loans • Date cleanup • Type casting: floats and ints • Would be better to track over time • number of late payments • compare predictions and actuals • XML… yuck! Tidbits and Lessons Learned….
  38. 38. BigML, Inc 38Basic Transformations Tools
  39. 39. BigML, Inc 39Basic Transformations Tools • Command Line? • join, cut, awk, sed, sort, uniq • Automation • Shell, Python, crontab, etc • Talend • BigML: bindings, bigmler, API, whizzml • Relational DB • MySQL • Non-Relational DB • MongoDB
  40. 40. BigML, Inc 40Basic Transformations Talend https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/ Denormalization Example
  41. 41. BigML, Inc 41Basic Transformations Summary • Data is awful • Requires clean-up • Transformations • Consumes an enormous part of the effort in applying ML • Techniques: • Denormalizing • Aggregating / Pivoting • Time windows / Streaming • What a real Workflow looks like and the tools required
  42. 42. BigML, Inc 2 Feature Engineering Creating Features that Make Machine Learning Work Poul Petersen CIO, BigML, Inc
  43. 43. BigML, Inc 3Feature Engineering what is Feature Engineering • This is really, really important - more than algorithm selection! • In fact, so important that BigML often does it automatically • ML Algorithms have no deeper understanding of data • Numerical: have a natural order, can be scaled, etc • Categorical: have discrete values, etc. • The "magic" is the ability to find patterns quickly and efficiently • ML Algorithms only know what you tell/show it with data • Medical: Kg and M, but BMI = Kg/M2 is better • Lending: Debt and Income, but DTI is better • Intuition can be risky: remember to prove it with an evaluation! Feature Engineering: applying domain knowledge of the data to create new features that allow ML algorithms to work better, or to work at all.
  44. 44. BigML, Inc 4Feature Engineering Built-in Transformations 2013-09-25 10:02 Date-Time Fields … year month day hour minute … … 2013 Sep 25 10 2 … … … … … … … … NUM NUMCAT NUM NUM • Date-Time fields have a lot of information "packed" into them • Splitting out the time components allows ML algorithms to discover time-based patterns. DATE-TIME
  45. 45. BigML, Inc 5Feature Engineering Built-in Transformations Categorical Fields for Clustering/LR … alchemy_category … … business … … recreation … … health … … … … CAT business health recreation … … 1 0 0 … … 0 0 1 … … 0 1 0 … … … … … … NUM NUM NUM • Clustering and Logistic Regression require numeric fields for inputs • Categorical values are transformed to numeric vectors automatically* • *Note: In BigML, clustering uses k-prototypes and the encoding used for LR can be configured.
  46. 46. BigML, Inc 6Feature Engineering Built-in Transformations Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon ‘em. TEXT Text Fields … great afraid born achieve … … 4 1 1 1 … … … … … … … NUM NUM NUM NUM • Unstructured text contains a lot of potentially interesting patterns • Bag-of-words analysis happens automatically and extracts the "interesting" tokens in the text • Another option is Topic Modeling to extract thematic meaning
  47. 47. BigML, Inc 7Feature Engineering Help ML to Work Better { “url":"cbsnews", "title":"Breaking News Headlines Business Entertainment World News “, "body":" news covering all the latest breaking national and world news headlines, including politics, sports, entertainment, business and more.” } TEXT title body Breaking News… news covering… … … TEXT TEXT When text is not actually unstructured • In this case, the text field has structure (key/value pairs) • Extracting the structure as new features may allow the ML algorithm to work better
  48. 48. BigML, Inc 8Feature Engineering FE Demo #1
  49. 49. BigML, Inc 9Feature Engineering Help ML to Work at all When the pattern does not exist Highway Number Direction Is Long 2 East-West FALSE 4 East-West FALSE 5 North-South TRUE 8 East-West FALSE 10 East-West TRUE … … … Goal: Predict principle direction from highway number ( = (mod (field "Highway Number") 2) 0)
  50. 50. BigML, Inc 10Feature Engineering FE Demo #2
  51. 51. BigML, Inc 11Feature Engineering Feature Engineering Discretization Total Spend 7,342.99 304.12 4.56 345.87 8,546.32 NUM “Predict will spend $3,521 with error $1,232” Spend Category Top 33% Bottom 33% Bottom 33% Middle 33% Top 33% CAT “Predict customer will be Top 33% in spending”
  52. 52. BigML, Inc 12Feature Engineering FE Demo #3
  53. 53. BigML, Inc 13Feature Engineering Built-ins for FE • Discretize: Converts a numeric value to categorical • Replace missing values: fixed/max/mean/median/etc • Normalize: Adjust a numeric value to a specific range of values while preserving the distribution • Math: Exponentiation, Logarithms, Squares, Roots, etc • Types: Force a field value to categorical, integer, or real • Random: Create random values for introducing noise • Statistics: Mean, Population • Refresh Fields: • Types: recomputes field types. Ex: #classes  >  1000 • Preferred: recomputes preferred status
  54. 54. BigML, Inc 14Feature Engineering Flatline Add Fields Computing with Existing Features Debt Income 10,134 100,000 85,234 134,000 8,112 21,500 0 45,900 17,534 52,000 NUM NUM (/ (field "Debt") (field "Income")) Debt   Income Debt to Income Ratio 0.10 0.64 0.38 0 0.34 NUM
  55. 55. BigML, Inc 15Feature Engineering FE Demo #4
  56. 56. BigML, Inc 16Feature Engineering What is Flatline? • DSL: • Invented by BigML - Programmatic / Optimized for speed • Transforms datasets into new datasets • Adding new fields / Filtering • Transformations are written in lisp-style syntax • Feature Engineering • Computing new fields: (/  (field  "Debt")  (field  “Income”)) • Programmatic Filtering: • Filtering datasets according to functions that evaluate to true/false using the row of data as an input. Flatline: a domain specific language for feature engineering and programmatic filtering
  57. 57. BigML, Inc 17Feature Engineering Flatline • Lisp style syntax: Operators come first • Correct: (+  1  2) => NOT Correct: (1  +  2) • Dataset Fields are first-class citizens • (field  “diabetes  pedigree”)   • Limited programming language structures • let, cond, if, map, list operators, */+-­‐, etc. • Built-in transformations • statistics, strings, timestamps, windows
  58. 58. BigML, Inc 18Feature Engineering Flatline s-expressions (=  0  (+  (abs  (  f  "Month  -­‐  3"  )  )  (abs  (  f  "Month  -­‐  2"))  (abs  (  f  "Month  -­‐  1")  )  )) Name Month - 3 Month - 2 Month - 1 Joe Schmo 123.23 0 0 Jane Plain 0 0 0 Mary Happy 0 55.22 243.33 Tom Thumb 12.34 8.34 14.56 Un-­‐Labelled  Data Labelled  data Name Month - 3 Month - 2 Month - 1 Default Joe Schmo 123.23 0 0 FALSE Jane Plain 0 0 0 TRUE Mary Happy 0 55.22 243.33 FALSE Tom Thumb 12.34 8.34 14.56 FALSE Adding Simple Labels to Data Define "default" as missing three payments in a row
  59. 59. BigML, Inc 19Feature Engineering FE Demo #5
  60. 60. BigML, Inc 20Feature Engineering Flatline s-expressions date volume price 1 34353 314 2 44455 315 3 22333 315 4 52322 321 5 28000 320 6 31254 319 7 56544 323 8 44331 324 9 81111 287 10 65422 294 11 59999 300 12 45556 302 13 19899 301 Current  -­‐  (4-­‐day  avg)   std  dev Shock: Deviations from a Trend day-4 day-3 day-2 day-1 4davg - 314 - 314 315 - 314 315 315 - 314 315 315 321 316.25 315 315 321 320 317.75 315 321 320 319 318.75
  61. 61. BigML, Inc 21Feature Engineering Flatline s-expressions Current  -­‐  (4-­‐day  avg)   std  dev Shock: Deviations from a Trend Current : (field “price”) 4-­‐day  avg: (avg-window “price” -4 -1) std  dev: (standard-deviation “price”) (/    (-­‐    (  f  "price")  (avg-­‐window  "price"  -­‐4,  -­‐1))  (standard-­‐deviaOon  "price"))
  62. 62. BigML, Inc 22Feature Engineering FE Demo #6
  63. 63. BigML, Inc 23Feature Engineering Advanced s-expressions Moon Phase% (  /  (  mod  (  -­‐  (  /  (  epoch  (  field  "date-­‐field"  ))  1000  )  621300  )  2551443  )  2551442  ) Highway isEven? (  =  (mod  (field  "Highway  Number")  2)  0) (  let  (R  6371000  latA  (to-­‐radians  {lat-­‐ref})  latB  (to-­‐radians  (  field  "LATITUDE"  )  )  latD  (  -­‐  latB  latA   )  longD  (  to-­‐radians  (  -­‐  (  field  "LONGITUDE"  )  {long-­‐ref}  )  )  a  (  +  (  square  (  sin  (  /  latD  2  )  )  )  (  *   (cos  latA)  (cos  latB)  (square  (  sin  (  /  longD  2)))  )  )  c  (  *  2  (  asin  (  min  (list  1  (sqrt  a))  )  )  )  )  (  *  R   c  )  )   Distance Lat/Long <=> Ref (Haversine)
  64. 64. BigML, Inc 24Feature Engineering WhizzML + Flatline HAVERSINE FLATLINE OUTPUT DATASET INPUT DATASET LONG Ref LAT Ref WHIZZML SCRIPT GALLERY
  65. 65. BigML, Inc 25Feature Engineering Feature Engineering Fix Missing Values in a “Meaningful” Way F i l t e r Zeros Model 
 insulin Predict 
 insulin Select 
 insulin Fixed
 Dataset Amended
 Dataset Original
 Dataset Clean
 Dataset ( if ( = (field "insulin") 0) (field "predicted insulin") (field "insulin"))
  66. 66. BigML, Inc 26Feature Engineering FE Demo #7
  67. 67. BigML, Inc 27Feature Engineering Feature Selection
  68. 68. BigML, Inc 28Feature Engineering Feature Selection • Model Summary • Field Importance • Algorithmic • Best-First Feature Selection • Boruta • Leakage • Tight Correlations (AD, Plot, Correlations) • Test Data • Perfect future knowledge cat diabetes.csv diabetes_testset.csv | sort | uniq -d | wc -l
  69. 69. BigML, Inc 29Feature Engineering Feature Selection • Sales pipeline where step n-1 has no other outcome then step n. • Stock close predicts stock open • Churn retention: the worst rep is actually the best (correlation != causation) • Cancer prediction where one input is a doctor ordered test for the condition • Account ID predicts fraud (because only new accounts are fraudsters) Leakage
  70. 70. BigML, Inc 30Feature Engineering Evaluate & Automate
  71. 71. BigML, Inc 31Feature Engineering Evaluate & Automate • Evaluate • Did you meet the goal? • If not, did you discover something else useful? • If not, start over • If you did… • Automate - You don’t want to hand code that every time, right? • Consider tools that are easy to automate • Scripting interface • APIs • Ability to maintain is important
  72. 72. BigML, Inc 32Feature Engineering The Process Data Transform Define Goal Model & Evaluate no yes Better Data Not Possible Tune Algorithm Goal Met? Automate Feature Engineer & Selection Better
 Features
  73. 73. BigML, Inc 33Feature Engineering Summary • Feature Engineering: what is it / why it is important • Automatic transformations: date-time, text, etc • Built-in functions: filtering and feature engineering • Discretization / Normalization / etc. • Flatline: programmatic feature engineering / filtering • Structure • Examples: Adding fields / filtering • When building features it is important to watch for leakage • The critical importance of automating

×