Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BigML Release: Data Transformations

105 views

Published on

BigML brings Data Transformations to the BigML platform, a key part of any Machine Learning workflow. Usually, data do not come ready to start working on a Machine Learning project. It can be noisy and come from many different sources in many different formats, thus, it is necessary to go through a preparation phase before applying Machine Learning. With this release, BigML adds new Data Transformation capabilities that greatly enhance existing ones. Discover the ability to perform SQL-style queries, Flatline editor improvements, and more ways to do feature engineering.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

BigML Release: Data Transformations

  1. 1. Data Transformations Release Introducing Data Transformations
  2. 2. BigML, Inc BigML Data Transformations Release Webinar Data Transformations POUL PETERSEN M.SC.- Chief Infrastructure Officer Please enter questions into chat box – We will answer some via chat and others at the end of the session https://bigml.com/releases/summer-2018 ATAKAN CETINSOY - VP of Predictive Applications Resources Moderator Speaker Contact support@bigml.com Twitter @bigmlcom Questions !2
  3. 3. BigML, Inc BigML Data Transformations Release Webinar Reality of a ML Application Data Transformations Feature Engineering Data Collection Evaluation & Retraining Seen Unseen Self-Driving Cars? !3
  4. 4. BigML, Inc BigML Data Transformations Release Webinar Effort of a ML Application State the problem as an ML task Data wrangling Feature engineering Modeling and Evaluations Predictions Measure Results Data transformations ~80% effort ~5% effort ~5% effort This is only such low effort because of platforms like Today’s release is the first step towards making this easy as well! Task ~10% effort Effort !4
  5. 5. BigML, Inc BigML Data Transformations Release Webinar Problem Statement • BigML’s SaaS https://bigml.com builds, on average, >40,000 trees/day • That’s trees only! Not counting LR, deepnets, clusters, etc. • And bigml.com only - not bigml.com.au as well • We need to ensure that all models are started and finished ASAP • Started is “easy”: queue monitoring + auto-scaling + heuristics • Finished is harder: How do we know if a model is taking too long? • What if we could predict how long a model should take to build? • Generate alarm if it takes longer than, e.g. 120% of the predicted time This sounds like a Machine Learning problem!!! !5
  6. 6. BigML, Inc BigML Data Transformations Release Webinar The Data… • Metadata from dataset : • Size in bytes, number of rows, number of columns, etc. • Number of numeric, categorical, datetime, text, and items fields • Metadata from model: • Objective type: classification or regression • Tree options: node_depth, missing_splits, randomization, sample • Subcluster: relates to server size • Objective: • Time elapsed to build the tree !6
  7. 7. BigML, Inc BigML Data Transformations Release Webinar The Data !7
  8. 8. BigML, Inc BigML Data Transformations Release Webinar Problem #1 There may be identical feature rows • A user testing a script, re-building with the same parameters • A Machine Learning class building the same model for an assignment • Users following an online tutorial • A BigML employee demoing the same dataset / model process #numeric #text #datetime size rows … time_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1431 4 0 1 56789 445423 … 8891 … … … … … … … 0 1 0 1515654 373 … 1673 Same Different However it happens, this is not properly formatted for ML !8 How?
  9. 9. BigML, Inc BigML Data Transformations Release Webinar #numeric #text #datetime size rows … time_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1431 4 0 1 56789 445423 … 8891 … … … … … … … 0 1 0 1515656 373 … 1673 feature_key time_ms c3f6c8be4f 300 a243ca3c38 1431 14f9d917bc 8891 … … a243ca3c38 1673 Collapse Transform & Aggregate All feature rows unique (fear not SQL experts) feature_key avg_ms c3f6c8be4f 300 a243ca3c38 1552 14f9d917bc 8891 … … Aggregate How to get there… !9
  10. 10. BigML, Inc BigML Data Transformations Release Webinar Flatline (sha1 (str (all-but ‘status.elapsed’))) !10
  11. 11. BigML, Inc BigML Data Transformations Release Webinar Aggregation: Count Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Count User001 3 User005 2 User003 2 User002 1 Count on User Number of playbacks per user !11
  12. 12. BigML, Inc BigML Data Transformations Release Webinar Aggregation: Count Distinct Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Distinct Genre User001 3 User005 2 User003 2 User002 1 Count distinct Genre on User Number of distinct Genre played per user !12
  13. 13. BigML, Inc BigML Data Transformations Release Webinar Aggregation: Count Missing Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Missing Device User001 0 User005 0 User003 0 User002 1 Count missing Device on User Number of missing Device per user !13
  14. 14. BigML, Inc BigML Data Transformations Release Webinar Aggregation: Sum Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Sum Duration User001 830 User005 521 User003 750 User002 218 Sum Duration on User Total Duration per User !14
  15. 15. BigML, Inc BigML Data Transformations Release Webinar Aggregation: Average Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Average Duration User001 276,67 User005 260,50 User003 375,00 User002 218 Average Duration on User Average Duration per User !15
  16. 16. BigML, Inc BigML Data Transformations Release Webinar Aggregation: Maximum Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Max Duration User001 328 User005 281 User003 418 User002 218 Maximum Duration on User Maximum Duration per User !16
  17. 17. BigML, Inc BigML Data Transformations Release Webinar Aggregation: Minimum Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Min Duration User001 190 User005 240 User003 332 User002 218 Minimum Duration on User !17 Minimum Duration per User • Similar for standard deviation and variance • Possible to combine multiple aggregations on the same field
  18. 18. BigML, Inc BigML Data Transformations Release Webinar Aggregations !18
  19. 19. BigML, Inc BigML Data Transformations Release Webinar Problem #2 We have this… #numeric #text #datetime size rows … time_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1431 4 0 1 56789 445423 … 8891 … … … … … … … 0 1 0 1515656 373 … 1673 Dataset 1 feature_key avg_ms c3f6c8be4f 300 a243ca3c38 1552 14f9d917bc 8891 … … Dataset 2 We want this… #numeric #text #datetime size rows … avg_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1552 4 0 1 56789 445423 … 8891 … … … … … … … Dataset !19
  20. 20. BigML, Inc BigML Data Transformations Release Webinar Joins • Datasets to join need to have a field in common • joining sales and demographics on customer_id • joining employee and budget details on department_id • Datasets to join do not need to have the same dimensions • Joins can be performed in several ways • Left, Right, Inner, Outer… !20
  21. 21. BigML, Inc BigML Data Transformations Release Webinar Left Join • In a Left join of dataset A to B: • Returns all records from the left A, 
 and the matched records from B • The result is NULL from B, if there is no match. _id field1 1 34 2 56 3 123 4 56 5 79 A _id field2 1 red 2 green 4 blue 6 black BLeft join _id field1 field2 1 34 red 2 56 green 3 123 null 4 56 blue 5 79 null A left join B= A B !21 No “3” or “5”
  22. 22. BigML, Inc BigML Data Transformations Release Webinar Right Join !22 • In a Right join of dataset A to B: • Returns all records from the right B, 
 and the matched records from A • The result is NULL from A, if there is no match. _id field1 1 34 2 56 3 123 4 56 5 79 A _id field2 1 red 2 green 4 blue 6 black BRight join _id field2 field1 1 red 34 2 green 56 4 blue 56 6 black null A right join B= BA No “6”, “3” unused
  23. 23. BigML, Inc BigML Data Transformations Release Webinar Inner Join • In an Inner join of dataset A to B: • Returns only records from the left A, 
 that match records from B • If there is no match between A and B, the record is ignored A B _id field1 1 34 2 56 3 123 4 56 5 79 A _id field2 1 red 2 green 4 blue 6 black BInner join _id field1 field2 1 34 red 2 56 green 4 56 blue A inner join B= !23 “3” and “5” unused “6” unused
  24. 24. BigML, Inc BigML Data Transformations Release Webinar Full Outer Join • In a Full join of dataset A to B: • Returns all records from the left A, 
 and records from B • If there is no match in either A and B, the field is null A B _id field1 1 34 2 56 3 123 4 56 5 79 A _id field2 1 red 2 green 4 blue 6 black Bfull join _id field1 field2 1 34 red 2 56 green 3 123 null 4 56 blue 5 79 null 6 null black A full join B= !24 A No “6” No “3” or “5”
  25. 25. BigML, Inc BigML Data Transformations Release Webinar Joins !25
  26. 26. BigML, Inc BigML Data Transformations Release Webinar Problem #3 Left join keeps all records from the left dataset #numeric #text #datetime size rows … avg_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1552 4 0 1 56789 445423 … 8891 … … … … … … … 0 1 0 1515656 373 … 1552 Same Rows !26
  27. 27. BigML, Inc BigML Data Transformations Release Webinar Remove Duplicates !27
  28. 28. BigML, Inc BigML Data Transformations Release Webinar Using the API • The UI has a limited set of data transformations • Aggregation (limited), Joins (limited), Remove Duplicates • More functions will be added: concat, ordering, multiple group by • The API supports nearly full SQL syntax for transforming datasets • Nested queries not supported (yet) - e.g. subselects • Better way to perform workflow: • SELECT 10001a, avg(000019) AS avg_status_elapsed FROM DS GROUP BY 10001a • Can perform entire workflow in one SQL using multiple “group by” !28
  29. 29. BigML, Inc BigML Data Transformations Release Webinar API Transformations !29
  30. 30. BigML, Inc BigML Data Transformations Release Webinar #numeric #text #datetime size rows … time_ms 12 2 0 74001 200 … 1975 1 0 1 22673 373 … 1552 1056 0 1 9231411 4352 … 7675 Problem #4 How do we add new data and retrain? • When adding a new batch of data • Avoid re-uploading by using a merge • Repeat the entire workflow on the merged dataset using Scriptify #numeric #text #datetime size rows … time_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1552 4 0 1 56789 445423 … 8891 !30
  31. 31. BigML, Inc BigML Data Transformations Release Webinar Merging & Scriptify !31
  32. 32. BigML, Inc BigML Data Transformations Release Webinar https://bigml.com/releases/summer-2018 More Info !32
  33. 33. Questions? @bigmlcom support@bigml.com

×