Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BSSML16 L6. Basic Data Transformations

438 views

Published on

Brazilian Summer School in Machine Learning 2016
Day 2 - Lecture 1: Basic Data Transformations
Lecturer: Poul Petersen (BigML)

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

BSSML16 L6. Basic Data Transformations

  1. 1. D E C E M B E R 8 - 9 , 2 0 1 6
  2. 2. BigML, Inc 2 Poul Petersen CIO, BigML, Inc. Basic TransformationsCreating Machine Learning Ready Data
  3. 3. BigML, Inc 3Basic Transformations In a Perfect World… Q: How does a physicist milk a cow? A: Well, first let us consider a spherical cow... Q: How does a data scientist build a model? A: Well, first let us consider perfectly formatted data…
  4. 4. BigML, Inc 4Basic Transformations The Dream CSV Dataset Model Profit!
  5. 5. BigML, Inc 5Basic Transformations The Reality CRM Web Accounts Transactions ML Ready?
  6. 6. BigML, Inc 6Basic Transformations Obstacles • Data Structure • Scattered across systems • Wrong "shape" • Unlabelled data • Data Value • Format: spelling, units • Missing values • Non-optimal correlation • Non-existant correlation • Data Significance • Unwanted: PII, Non-Preferred • Expensive to collect • Insidious: Leakage, obviously correlated Data Transformation Feature Engineering Feature Selection
  7. 7. BigML, Inc 7Basic Transformations The Process • Define a clear idea of the goal. • sometimes this comes later… • Understand what ML tasks will achieve the goal. • Transform the data • where is it, how is it stored? • what are the features? • can you access it programmatically? • Feature Engineering: transform the data you have into the data you actually need. • Evaluate: Try it on a small scale • Accept that you might have to start over…. • But when it works, automate it!!!!
  8. 8. BigML, Inc 8 Data Transformations
  9. 9. BigML, Inc 9Basic Transformations BigML Tasks Goal • Will this customer default on a loan? • How many customers will apply for a loan next month? • Is the consumption of this product unusual? • Is the behavior of the customers similar? • Are these products purchased together? ML Task Classification Regression Anomaly Detection Cluster Analysis Association Discovery
  10. 10. BigML, Inc 10Basic Transformations Classification CategoricalTrainingTesting Predicting
  11. 11. BigML, Inc 11Basic Transformations Regression NumericTrainingTesting Predicting
  12. 12. BigML, Inc 12Basic Transformations Anomaly Detection
  13. 13. BigML, Inc 13Basic Transformations Cluster Analysis
  14. 14. BigML, Inc 14Basic Transformations Association Discovery
  15. 15. BigML, Inc 15Basic Transformations Data Labelling Data may not have labels needed for doing classification Create specific metrics for adding labels Name Month - 3 Month - 2 Month - 1 Joe Schmo 123,23 0 0 Jane Plain 0 0 0 Mary Happy 0 55,22 243,33 Tom Thumb 12,34 8,34 14,56 Un-Labelled Data Labelled data Name Month - 3 Month - 2 Month - 1 Default Joe Schmo 123,23 0 0 FALSE Jane Plain 0 0 0 TRUE Mary Happy 0 55,22 243,33 FALSE Tom Thumb 12,34 8,34 14,56 FALSE Can be done at Feature Engineering step as well
  16. 16. BigML, Inc 16Basic Transformations ML Ready DataInstances Fields (Features) Tabular Data (rows and columns): • Each row • is one instance. • contains all the information about that one instance. • Each column • is a field that describes a property of the instance.
  17. 17. BigML, Inc 17Basic Transformations SF Restaurants Example https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores/stya-26eb https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/ create database sf_restaurants; use sf_restaurants; create table businesses (business_id int, name varchar(1000), address varchar(1000), city varchar(1000), state varchar(100), postal_code varchar(100), latitude varchar(100), longitude varchar(100), phone_number varchar(100)); load data local infile './businesses.csv' into table businesses fields terminated by ',' enclosed by '"' lines terminated by 'rn' ignore 1 lines; create table inspections (business_id int, score varchar(10), idate varchar(8), itype varchar(100)); load data local infile './inspections.csv' into table inspections fields terminated by ',' enclosed by '"' lines terminated by 'rn' ignore 1 lines; create table violations (business_id int, vdate varchar(8), description varchar(1000)); load data local infile './violations.csv' into table violations fields terminated by ',' enclosed by '"' lines terminated by 'rn' ignore 1 lines; create table legend (Minimum_Score int, Maximum_Score int, Description varchar(100)); load data local infile './legend.csv' into table legend fields terminated by ',' enclosed by '"' lines terminated by 'rn' ignore 1 lines;
  18. 18. BigML, Inc 18Basic Transformations Data Cleaning Homogenize missing values and different types in the same feature, fix input errors, correct semantic issues, types, etc. Name Date Duration (s) Genre Plays Highway star 1984-05-24 - Rock 139 Blues alive 1990/03/01 281 Blues 239 Lonely planet 2002-11-19 5:32s Techno 42 Dance, dance 02/23/1983 312 Disco N/A The wall 1943-01-20 218 Reagge 83 Offside down 1965-02-19 4 minutes Techno 895 The alchemist 2001-11-21 418 Bluesss 178 Bring me down 18-10-98 328 Classic 21 The scarecrow 1994-10-12 269 Rock 734 Original data Name Date Duration (s) Genre Plays Highway star 1984-05-24 Rock 139 Blues alive 1990-03-01 281 Blues 239 Lonely planet 2002-11-19 332 Techno 42 Dance, dance 1983-02-23 312 Disco The wall 1943-01-20 218 Reagge 83 Offside down 1965-02-19 240 Techno 895 The alchemist 2001-11-21 418 Blues 178 Bring me down 1998-10-18 328 Classic 21 The scarecrow 1994-10-12 269 Rock 734 Cleaned data update violations set description = substr(description,1,instr(description,' [ date violation corrected:')-1) where instr(description,' [ date violation corrected:') > 0;
  19. 19. BigML, Inc 19Basic Transformations Denormalizing business inspections violations scores Instances Features (millions) join Data is usually normalized in relational databases, ML-Ready datasets need the information de-normalized in a single file/dataset. create table scores select * from inspections left join businesses using (business_id);
  20. 20. BigML, Inc 20Basic Transformations Aggregating User Num.Playbacks Total Time Pref.Device User001 3 830 Tablet User002 1 218 Smartphone User003 3 1019 TV User005 2 521 Tablet Aggregated data (list of users) When the entity to model is different from the provided data, an aggregation to get the entity might be needed. Content Genr e Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Tech no 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reag ge 218 2015-05-14 09:02:55 User002 Smartphone Offside down Tech no 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Class ic 328 2015-05-15 06:59:56 User001 Tablet The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone Original data (list of playbacks) select business_id,vdate,count(*) as violation_num,group_concat(description) as violation_txt from violations group by business_id, vdate; select * from scores left join violations_aggregated ON (scores.business_id=violations_aggregated.business_id) AND ( vdate=idate); tail -n+2 playlists.csv | cut -d',' -f5 | sort | uniq -c tail -n+2 playlist.csv | awk -F',' '{arr[$5]+=$3} END {for (i in arr) {print arr[i],i}}'
  21. 21. BigML, Inc 21Basic Transformations Pivoting Different values of a feature are pivoted to new columns in the result dataset. Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Smartphone Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone Original data User Num.Playback s Total Time Pref.Device NP_TV NP_Tablet NP_Smartphone TT_TV TT_Tablet TT_Smartphone User001 3 830 Tablet 1 2 0 190 640 0 User002 1 218 Smartphone 0 0 1 0 0 218 User003 3 1019 TV 2 0 1 750 0 269 User005 2 521 Tablet 0 2 0 0 521 0 Aggregated data with pivoted columns
  22. 22. BigML, Inc 22Basic Transformations Time Windows Create new features using values over different periods of time Instances Features Time Instances Features (millions) (thousands) t=1 t=2 t=3 select business_id, score as score_2013, idate as idate_2013 from inspections where substr(idate,1,4) = "2013"; create table scores_over_time select * from businesses left join scores_2013 USING (business_id) left join scores_2014 USING (business_id);
  23. 23. BigML, Inc 23Basic Transformations Updates Need a current view of the data, but new data only comes in batches of changes day 1day 2day 3 Instances Features
  24. 24. BigML, Inc 24Basic Transformations Streaming Data only comes in single changes data stream Instances Features Stream Batch (kafka, etc)
  25. 25. BigML, Inc 25Basic Transformations Structuring Output • A CSV file uses plain text to store tabular data. • In a CSV file, each row of the file is an instance. • Each column in a row is usually separated by a comma (,) but other "separators" like semi-colon (;), colon (:), pipe (|), can also be used. Each row must contain the same number of fields • but they can be null • Fields can be quoted using double quotes ("). • Fields that contain commas or line separators must be quoted. • Quotes (") in fields must be doubled (""). • The character encoding must be UTF-8 • Optionally, a CSV file can use the first line as a header to provide the names of each field. After all the data transformations, a CSV (“Comma-Separated Values) file has to be generated, following the rules below:
  26. 26. BigML, Inc 26Basic Transformations Prosper Loan Life Cycle Submit Bids Cancelled Withdraw Funded Expired Defaulted Paid Current Late Q: Which new loans make it to funded? Q: Which funded loans make it to paid? Q: If funded, what will be the rate? Classification Regression Classification Goal ML Task
  27. 27. BigML, Inc 27Basic Transformations Prosper Example Data Provided in XML updates!! fetch.sh “curl” daily export.sh import.py XML bigml.sh Model Predict Share in gallery Status LoanStatus BorrowerRate
  28. 28. BigML, Inc 28Basic Transformations Prosper Example • XML… yuck! • MongoDB has CSV export and is record based so it is easy to handle changing data structure. • Feature Engineering • There are 5 different classes of “bad” loans • Date cleanup • Type casting: floats and ints • Would be better to track over time • number of late payments • compare predictions and actuals • XML… yuck! Tidbits and Lessons Learned….
  29. 29. BigML, Inc 29Basic Transformations Tools • Command Line? • join, cut, awk, sed, sort, uniq • Automation • Shell, Python, etc • Talend • BigML: bindings, bigmler, API, whizzml • Relational DB • MySQL • Non-Relational DB • MongoDB
  30. 30. BigML, Inc 30Basic Transformations Talend https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/ Denormalization Example

×