Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

VSSML16 L5. Basic Data Transformations

435 views

Published on

VSSML16 L5. Basic Data Transformations
Valencian Summer School in Machine Learning 2016
Day 2 VSSML16
Lecture 5
Basic Data Transformations
Poul Petersen (BigML)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2016

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

VSSML16 L5. Basic Data Transformations

  1. 1. September 8-9, 2016
  2. 2. BigML, Inc 2 Basic Transformations Poul Pertesen CIO, BigML, Inc Creating Machine Learning Ready Data
  3. 3. BigML, Inc 3Machine Learning-Ready Data Basic Transformations Q: How does a physicist milk a cow? A: Well, first let us consider a spherical cow... Q: How does a data scientist build a model? A: Well, first let us consider perfectly formatted data…
  4. 4. BigML, Inc 4Machine Learning-Ready Data The Dream CSV Dataset Model Profit!
  5. 5. BigML, Inc 5Machine Learning-Ready Data The Reality CRM Web Accounts Transactions ML Ready? Is all hope lost? How do you even start?
  6. 6. BigML, Inc 6Machine Learning-Ready Data Holistic Approach • Define a clear idea of the goal. • Understand what ML tasks will achieve the goal. • Understand the data structure to perform those ML tasks. • Find out what kind of data you have and make it ML-Ready • where is it, how is it stored? • what are the features? • can you access it programmatically? • Feature Engineering: transform the data you have into the data you actually need. • Evaluate: Try it on a small scale • Accept that you might have to start over…. • But when it works, automate it!!!!
  7. 7. BigML, Inc 7Machine Learning-Ready Data Holistic Approach Define Goal & ML Task
  8. 8. BigML, Inc 8Machine Learning-Ready Data Understand ML Tasks Goal • Will this customer default on a loan? • How many customers will apply for a loan next month? • Is the consumption of this product unusual? • Is the behavior of the customers similar? • Are these product purchased together? ML Task Classification Regression Anomaly Detection Cluster Analysis Association Discovery
  9. 9. BigML, Inc 9Machine Learning-Ready Data Holistic Approach Required Data Structure
  10. 10. BigML, Inc 10Machine Learning-Ready Data Classification CategoricalTrainingTesting Predicting
  11. 11. BigML, Inc 11Machine Learning-Ready Data Regression NumericTrainingTesting Predicting
  12. 12. BigML, Inc 12Machine Learning-Ready Data Anomaly Detection
  13. 13. BigML, Inc 13Machine Learning-Ready Data Cluster Analysis
  14. 14. BigML, Inc 14Machine Learning-Ready Data Association Discovery
  15. 15. BigML, Inc 15Machine Learning-Ready Data Holistic Approach Make Your Data ML-Ready
  16. 16. BigML, Inc 16Machine Learning-Ready Data ML-Ready Data Instances Fields  (Features) Tabular Data: • Each row is one of the instances. • Each column is a field that describes a property of the 
 instance that is relevant to the question being modeled. • Fields can be: already be present in your data derived from your data or generated using other fields. Machine Learning Algorithms consume instances of the question that you want to model. !! Danger Ahead !!
  17. 17. BigML, Inc 17Machine Learning-Ready Data Cleansing Homogenize missing values and different types in the same feature, fix input errors, correct semantic issues, types, etc. Name Date Duration (s) Genre Plays Highway star 1984-05-24 - Rock 139 Blues alive 1990/03/01 281 Blues 239 Lonely planet 2002-11-19 5:32s Techno 42 Dance, dance 02/23/1983 312 Disco N/A The wall 1943-01-20 218 Reagge 83 Offside down 1965-02-19 4 minutes Techno 895 The alchemist 2001-11-21 418 Bluesss 178 Bring me down 18-10-98 328 Classic 21 The scarecrow 1994-10-12 269 Rock 734 Original  data Name Date Duration (s) Genre Plays Highway star 1984-05-24 Rock 139 Blues alive 1990-03-01 281 Blues 239 Lonely planet 2002-11-19 332 Techno 42 Dance, dance 1983-02-23 312 Disco The wall 1943-01-20 218 Reagge 83 Offside down 1965-02-19 240 Techno 895 The alchemist 2001-11-21 418 Blues 178 Bring me down 1998-10-18 328 Classic 21 The scarecrow 1994-10-12 269 Rock 734 Cleaned  data
  18. 18. BigML, Inc 18Machine Learning-Ready Data Denormalizing users artists tracks albums Instances Features (millions) join Data is usually normalized in relational databases, ML-Ready datasets need the information de-normalized in a single file/dataset.
  19. 19. BigML, Inc 19Machine Learning-Ready Data Aggregating User Num.Playbacks Total Time Pref.Device User001 3 830 Tablet User002 1 218 Smartphone User003 3 1019 TV User005 2 521 Tablet Aggregated data (list of users) When the entity to model is different from the provided data, an aggregation to get the entity might be needed. Content Genr e Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Tech no 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reag ge 218 2015-05-14 09:02:55 User002 Smartphone Offside down Tech no 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Class ic 328 2015-05-15 06:59:56 User001 Tablet The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone Original data (list of playbacks) tail -n+2 playlists.csv | cut -d',' -f5 | sort | uniq -c
  20. 20. BigML, Inc 20Machine Learning-Ready Data Pivoting Different values of a feature are pivoted to new columns in the result dataset. Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Smartphone Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet The scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone Original data User Num.Playback s Total Time Pref.Device NP_TV NP_Tablet NP_Smartphone TT_TV TT_Tablet TT_Smartphone User001 3 830 Tablet 1 2 0 190 640 0 User002 1 218 Smartphone 0 0 1 0 0 218 User003 3 1019 TV 2 0 1 750 0 269 User005 2 521 Tablet 0 2 0 0 521 0 Aggregated data with pivoted columns
  21. 21. BigML, Inc 21Machine Learning-Ready Data Time Windows Create new features using values over different periods of time Instances Features Time Instances Features (millions) (thousands) t=1 t=2 t=3
  22. 22. BigML, Inc 22Machine Learning-Ready Data Updates Need a current view of the data, but new data only comes in batches of changes day  1day  2day  3 Instances Features
  23. 23. BigML, Inc 23Machine Learning-Ready Data Structuring Output • A CSV file uses plain text to store tabular data. • In a CSV file, each row of the file is an instance. • Each column in a row is usually separated by a comma (,) but other "separators" like semi-colon (;), colon (:), pipe (|), can also be used. Each row must contain the same number of fields • but they can be null • Fields can be quoted using double quotes ("). • Fields that contain commas or line separators must be quoted. • Quotes (") in fields must be doubled (""). • The character encoding must be UTF-8 • Optionally, a CSV file can use the first line as a header to provide the names of each field. After all the data transformations, a CSV (“Comma-Separated Values) file has to be generated, following the rules below:
  24. 24. BigML, Inc 24Machine Learning-Ready Data Holistic Approach Feature Engineering
  25. 25. BigML, Inc 25Machine Learning-Ready Data Feature Engineering • Flatline • Domain Specific Language for data generation and filtering • Works with datasets -> datasets • Lots of built-in functions • Sliding windows • Date/Time parsing • Flatline Editor (in UI) • https://github.com/bigmlcom/flatline
  26. 26. BigML, Inc 26Machine Learning-Ready Data Feature Engineering • Feature Engineering of Numeric features: • Discretization (percentiles, within percentiles, groups) • Replacement • Normalization • Exponentiation, Logarithms, Squares, etc. • Shock • Feature Engineering of Text features: • Misspellings • Length • Number of subordinate sentences • Language • Levenshtein distance • Stacking: • Compute a field using non-linear combinations of other fields
  27. 27. BigML, Inc 27Machine Learning-Ready Data Holistic Approach Test & Automate
  28. 28. BigML, Inc 28Machine Learning-Ready Data Test & Automate • Test - Evaluate • Did you meet the goal? • If not, did you discover something else useful? • If not, start over • If you did… • Automate - You don’t want to hand code that every time, right? • Consider tools that are easy to automate • scripting interface • APIs • Ability to maintenance is important
  29. 29. BigML, Inc 29Machine Learning-Ready Data Tools • Command Line? • join, cut, awk, sed, sort, uniq • Automation • Shell, Python, etc • Talend • BigML: bindings, bigmler, API, whizzml • Relational DB • MySQL • Non-Relational DB • MongoDB
  30. 30. BigML, Inc 30Machine Learning-Ready Data Prosper Submit Bids Cancelled Withdraw Funded Expired Defaulted Paid Current Late Q: Which new loans make it to funded? Q: Which funded loans make it to paid? Q: If funded, what will be the rate? Classification Regression Classification
  31. 31. BigML, Inc 31Machine Learning-Ready Data Prosper Data Provided in XML updates!! fetch.sh “curl” daily export.sh import.py XML bigml.sh Model Predict Share in gallery Status LoanStatus BorrowerRate
  32. 32. BigML, Inc 32Machine Learning-Ready Data Prosper • XML… yuck! • MongoDB has CSV export and is record based so it is easy to handle changing data structure. • Feature Engineering • There are 5 different classes of “bad” loans • Date cleanup • Type casting: floats and ints • Would be better to track over time • number of late payments • compare predictions and actuals • XML… yuck! Tidbits and Lessons Learned….
  33. 33. BigML, Inc 33Machine Learning-Ready Data Diabetes Fix Missing Values in a “Meaningful” Way Filter Zeros Model 
 insulin Predict 
 insulin Select 
 insulin Fixed
 Dataset Amended
 Dataset Original
 Dataset Clean
 Dataset
  34. 34. BigML, Inc 34Machine Learning-Ready Data Stock Prices (/ (- ( f "price") (avg-window "price" -4, -1)) (standard-deviation "price")) Shock: Deviations from Trend date volume price 1 34353 314 2 44455 315 3 22333 315 4 52322 321 5 28000 320 6 31254 319 7 56544 323 8 44331 324 9 81111 287 10 65422 294 11 59999 300 12 45556 302 13 19899 301 14 21453 302 314 314 315 314 315 315 314 315 315 321 315 315 321 320 315 321 320 319 4-Day moving avg) Current - (4-day avg) std dev
  35. 35. BigML, Inc 35Machine Learning-Ready Data Talend https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/ Denormalization Example
  36. 36. BigML, Inc 36Machine Learning-Ready Data Talend https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/ Denormalization Example

×