Data Transformations Release
Introducing
Data Transformations
BigML, Inc BigML Data Transformations Release Webinar
Data Transformations
POUL PETERSEN M.SC.- Chief Infrastructure Officer
Please enter questions into chat box – We will answer
some via chat and others at the end of the session
https://bigml.com/releases/summer-2018
ATAKAN CETINSOY - VP of Predictive Applications
Resources
Moderator
Speaker
Contact support@bigml.com
Twitter @bigmlcom
Questions
!2
BigML, Inc BigML Data Transformations Release Webinar
Reality of a ML Application
Data

Transformations
Feature

Engineering
Data

Collection
Evaluation

& Retraining
Seen
Unseen
Self-Driving Cars?
!3
BigML, Inc BigML Data Transformations Release Webinar
Effort of a ML Application
State the problem as an ML task
Data wrangling
Feature engineering
Modeling and Evaluations
Predictions
Measure Results
Data transformations ~80% effort
~5% effort
~5% effort
This is only such low
effort because of
platforms like
Today’s release is the
first step towards
making this
easy as well!
Task
~10% effort
Effort
!4
BigML, Inc BigML Data Transformations Release Webinar
Problem Statement
• BigML’s SaaS https://bigml.com builds, on average, >40,000 trees/day
• That’s trees only! Not counting LR, deepnets, clusters, etc.
• And bigml.com only - not bigml.com.au as well
• We need to ensure that all models are started and finished ASAP
• Started is “easy”: queue monitoring + auto-scaling + heuristics
• Finished is harder: How do we know if a model is taking too long?
• What if we could predict how long a model should take to build?
• Generate alarm if it takes longer than, e.g. 120% of the predicted time
This sounds like a Machine Learning problem!!!
!5
BigML, Inc BigML Data Transformations Release Webinar
The Data…
• Metadata from dataset :
• Size in bytes, number of rows, number of columns, etc.
• Number of numeric, categorical, datetime, text, and items fields
• Metadata from model:
• Objective type: classification or regression
• Tree options: node_depth, missing_splits, randomization, sample
• Subcluster: relates to server size
• Objective:
• Time elapsed to build the tree
!6
BigML, Inc BigML Data Transformations Release Webinar
The Data
!7
BigML, Inc BigML Data Transformations Release Webinar
Problem #1
There may be identical feature rows
• A user testing a script, re-building with the same parameters
• A Machine Learning class building the same model for an assignment
• Users following an online tutorial
• A BigML employee demoing the same dataset / model process
#numeric #text #datetime size rows … time_ms
34 0 0 46354 1001 … 300
0 1 0 1515654 373 … 1431
4 0 1 56789 445423 … 8891
… … … … … … …
0 1 0 1515654 373 … 1673
Same Different
However it happens, this is not properly formatted for ML
!8
How?
BigML, Inc BigML Data Transformations Release Webinar
#numeric #text #datetime size rows … time_ms
34 0 0 46354 1001 … 300
0 1 0 1515654 373 … 1431
4 0 1 56789 445423 … 8891
… … … … … … …
0 1 0 1515656 373 … 1673
feature_key time_ms
c3f6c8be4f 300
a243ca3c38 1431
14f9d917bc 8891
… …
a243ca3c38 1673
Collapse
Transform & Aggregate
All feature rows unique
(fear not SQL experts)
feature_key avg_ms
c3f6c8be4f 300
a243ca3c38 1552
14f9d917bc 8891
… …
Aggregate
How to get there…
!9
BigML, Inc BigML Data Transformations Release Webinar
Flatline
(sha1 (str (all-but ‘status.elapsed’)))
!10
BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Count
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User Count
User001 3
User005 2
User003 2
User002 1
Count
on User
Number of playbacks per user
!11
BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Count Distinct
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Distinct
Genre
User001 3
User005 2
User003 2
User002 1
Count
distinct
Genre
on User
Number of distinct Genre played per user
!12
BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Count Missing
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Missing
Device
User001 0
User005 0
User003 0
User002 1
Count
missing
Device
on User
Number of missing Device per user
!13
BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Sum
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Sum
Duration
User001 830
User005 521
User003 750
User002 218
Sum
Duration
on User
Total Duration per User
!14
BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Average
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Average
Duration
User001 276,67
User005 260,50
User003 375,00
User002 218
Average
Duration
on User
Average Duration per User
!15
BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Maximum
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Max
Duration
User001 328
User005 281
User003 418
User002 218
Maximum
Duration
on User
Maximum Duration per User
!16
BigML, Inc BigML Data Transformations Release Webinar
Aggregation: Minimum
Content Genre Duration Play Time User Device
Highway star Rock 190 2015-05-12 16:29:33 User001 TV
Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV
Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002
Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet
The alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me
down
Classic 328 2015-05-15 06:59:56 User001 Tablet
User
Min
Duration
User001 190
User005 240
User003 332
User002 218
Minimum
Duration
on User
!17
Minimum Duration per User
• Similar for standard deviation and variance
• Possible to combine multiple aggregations on the same field
BigML, Inc BigML Data Transformations Release Webinar
Aggregations
!18
BigML, Inc BigML Data Transformations Release Webinar
Problem #2
We have this…
#numeric #text #datetime size rows … time_ms
34 0 0 46354 1001 … 300
0 1 0 1515654 373 … 1431
4 0 1 56789 445423 … 8891
… … … … … … …
0 1 0 1515656 373 … 1673
Dataset 1
feature_key avg_ms
c3f6c8be4f 300
a243ca3c38 1552
14f9d917bc 8891
… …
Dataset 2
We want this…
#numeric #text #datetime size rows … avg_ms
34 0 0 46354 1001 … 300
0 1 0 1515654 373 … 1552
4 0 1 56789 445423 … 8891
… … … … … … …
Dataset
!19
BigML, Inc BigML Data Transformations Release Webinar
Joins
• Datasets to join need to have a field in common
• joining sales and demographics on customer_id
• joining employee and budget details on department_id
• Datasets to join do not need to have the same dimensions
• Joins can be performed in several ways
• Left, Right, Inner, Outer…
!20
BigML, Inc BigML Data Transformations Release Webinar
Left Join
• In a Left join of dataset A to B:
• Returns all records from the left A, 

and the matched records from B
• The result is NULL from B, if there is no match.
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
BLeft join
_id field1 field2
1 34 red
2 56 green
3 123 null
4 56 blue
5 79 null
A left join B=
A B
!21
No “3” or “5”
BigML, Inc BigML Data Transformations Release Webinar
Right Join
!22
• In a Right join of dataset A to B:
• Returns all records from the right B, 

and the matched records from A
• The result is NULL from A, if there is no match.
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
BRight join
_id field2 field1
1 red 34
2 green 56
4 blue 56
6 black null
A right join B=
BA
No “6”,
“3” unused
BigML, Inc BigML Data Transformations Release Webinar
Inner Join
• In an Inner join of dataset A to B:
• Returns only records from the left A, 

that match records from B
• If there is no match between A and B, the record is ignored
A B
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
BInner join
_id field1 field2
1 34 red
2 56 green
4 56 blue
A inner join B=
!23
“3” and “5”
unused
“6” unused
BigML, Inc BigML Data Transformations Release Webinar
Full Outer Join
• In a Full join of dataset A to B:
• Returns all records from the left A, 

and records from B
• If there is no match in either A and B, the field is null
A B
_id field1
1 34
2 56
3 123
4 56
5 79
A
_id field2
1 red
2 green
4 blue
6 black
Bfull join
_id field1 field2
1 34 red
2 56 green
3 123 null
4 56 blue
5 79 null
6 null black
A full join B=
!24
A
No “6”
No “3” or “5”
BigML, Inc BigML Data Transformations Release Webinar
Joins
!25
BigML, Inc BigML Data Transformations Release Webinar
Problem #3
Left join keeps all records from the left dataset
#numeric #text #datetime size rows … avg_ms
34 0 0 46354 1001 … 300
0 1 0 1515654 373 … 1552
4 0 1 56789 445423 … 8891
… … … … … … …
0 1 0 1515656 373 … 1552
Same Rows
!26
BigML, Inc BigML Data Transformations Release Webinar
Remove Duplicates
!27
BigML, Inc BigML Data Transformations Release Webinar
Using the API
• The UI has a limited set of data transformations
• Aggregation (limited), Joins (limited), Remove Duplicates
• More functions will be added: concat, ordering, multiple group by
• The API supports nearly full SQL syntax for transforming datasets
• Nested queries not supported (yet) - e.g. subselects
• Better way to perform workflow:
• SELECT 10001a, avg(000019) AS avg_status_elapsed FROM DS GROUP BY 10001a
• Can perform entire workflow in one SQL using multiple “group by”
!28
BigML, Inc BigML Data Transformations Release Webinar
API Transformations
!29
BigML, Inc BigML Data Transformations Release Webinar
#numeric #text #datetime size rows … time_ms
12 2 0 74001 200 … 1975
1 0 1 22673 373 … 1552
1056 0 1 9231411 4352 … 7675
Problem #4
How do we add new data and retrain?
• When adding a new batch of data
• Avoid re-uploading by using a merge
• Repeat the entire workflow on the merged dataset using Scriptify
#numeric #text #datetime size rows … time_ms
34 0 0 46354 1001 … 300
0 1 0 1515654 373 … 1552
4 0 1 56789 445423 … 8891
!30
BigML, Inc BigML Data Transformations Release Webinar
Merging & Scriptify
!31
BigML, Inc BigML Data Transformations Release Webinar
https://bigml.com/releases/summer-2018
More Info
!32
Questions?
@bigmlcom support@bigml.com

BigML Release: Data Transformations

  • 1.
  • 2.
    BigML, Inc BigMLData Transformations Release Webinar Data Transformations POUL PETERSEN M.SC.- Chief Infrastructure Officer Please enter questions into chat box – We will answer some via chat and others at the end of the session https://bigml.com/releases/summer-2018 ATAKAN CETINSOY - VP of Predictive Applications Resources Moderator Speaker Contact support@bigml.com Twitter @bigmlcom Questions !2
  • 3.
    BigML, Inc BigMLData Transformations Release Webinar Reality of a ML Application Data Transformations Feature Engineering Data Collection Evaluation & Retraining Seen Unseen Self-Driving Cars? !3
  • 4.
    BigML, Inc BigMLData Transformations Release Webinar Effort of a ML Application State the problem as an ML task Data wrangling Feature engineering Modeling and Evaluations Predictions Measure Results Data transformations ~80% effort ~5% effort ~5% effort This is only such low effort because of platforms like Today’s release is the first step towards making this easy as well! Task ~10% effort Effort !4
  • 5.
    BigML, Inc BigMLData Transformations Release Webinar Problem Statement • BigML’s SaaS https://bigml.com builds, on average, >40,000 trees/day • That’s trees only! Not counting LR, deepnets, clusters, etc. • And bigml.com only - not bigml.com.au as well • We need to ensure that all models are started and finished ASAP • Started is “easy”: queue monitoring + auto-scaling + heuristics • Finished is harder: How do we know if a model is taking too long? • What if we could predict how long a model should take to build? • Generate alarm if it takes longer than, e.g. 120% of the predicted time This sounds like a Machine Learning problem!!! !5
  • 6.
    BigML, Inc BigMLData Transformations Release Webinar The Data… • Metadata from dataset : • Size in bytes, number of rows, number of columns, etc. • Number of numeric, categorical, datetime, text, and items fields • Metadata from model: • Objective type: classification or regression • Tree options: node_depth, missing_splits, randomization, sample • Subcluster: relates to server size • Objective: • Time elapsed to build the tree !6
  • 7.
    BigML, Inc BigMLData Transformations Release Webinar The Data !7
  • 8.
    BigML, Inc BigMLData Transformations Release Webinar Problem #1 There may be identical feature rows • A user testing a script, re-building with the same parameters • A Machine Learning class building the same model for an assignment • Users following an online tutorial • A BigML employee demoing the same dataset / model process #numeric #text #datetime size rows … time_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1431 4 0 1 56789 445423 … 8891 … … … … … … … 0 1 0 1515654 373 … 1673 Same Different However it happens, this is not properly formatted for ML !8 How?
  • 9.
    BigML, Inc BigMLData Transformations Release Webinar #numeric #text #datetime size rows … time_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1431 4 0 1 56789 445423 … 8891 … … … … … … … 0 1 0 1515656 373 … 1673 feature_key time_ms c3f6c8be4f 300 a243ca3c38 1431 14f9d917bc 8891 … … a243ca3c38 1673 Collapse Transform & Aggregate All feature rows unique (fear not SQL experts) feature_key avg_ms c3f6c8be4f 300 a243ca3c38 1552 14f9d917bc 8891 … … Aggregate How to get there… !9
  • 10.
    BigML, Inc BigMLData Transformations Release Webinar Flatline (sha1 (str (all-but ‘status.elapsed’))) !10
  • 11.
    BigML, Inc BigMLData Transformations Release Webinar Aggregation: Count Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Count User001 3 User005 2 User003 2 User002 1 Count on User Number of playbacks per user !11
  • 12.
    BigML, Inc BigMLData Transformations Release Webinar Aggregation: Count Distinct Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Distinct Genre User001 3 User005 2 User003 2 User002 1 Count distinct Genre on User Number of distinct Genre played per user !12
  • 13.
    BigML, Inc BigMLData Transformations Release Webinar Aggregation: Count Missing Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Missing Device User001 0 User005 0 User003 0 User002 1 Count missing Device on User Number of missing Device per user !13
  • 14.
    BigML, Inc BigMLData Transformations Release Webinar Aggregation: Sum Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Sum Duration User001 830 User005 521 User003 750 User002 218 Sum Duration on User Total Duration per User !14
  • 15.
    BigML, Inc BigMLData Transformations Release Webinar Aggregation: Average Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Average Duration User001 276,67 User005 260,50 User003 375,00 User002 218 Average Duration on User Average Duration per User !15
  • 16.
    BigML, Inc BigMLData Transformations Release Webinar Aggregation: Maximum Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Max Duration User001 328 User005 281 User003 418 User002 218 Maximum Duration on User Maximum Duration per User !16
  • 17.
    BigML, Inc BigMLData Transformations Release Webinar Aggregation: Minimum Content Genre Duration Play Time User Device Highway star Rock 190 2015-05-12 16:29:33 User001 TV Blues alive Blues 281 2015-05-13 12:31:21 User005 Tablet Lonely planet Techno 332 2015-05-13 14:26:04 User003 TV Dance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet The wall Reagge 218 2015-05-14 09:02:55 User002 Offside down Techno 240 2015-05-14 11:26:32 User005 Tablet The alchemist Blues 418 2015-05-14 21:44:15 User003 TV Bring me down Classic 328 2015-05-15 06:59:56 User001 Tablet User Min Duration User001 190 User005 240 User003 332 User002 218 Minimum Duration on User !17 Minimum Duration per User • Similar for standard deviation and variance • Possible to combine multiple aggregations on the same field
  • 18.
    BigML, Inc BigMLData Transformations Release Webinar Aggregations !18
  • 19.
    BigML, Inc BigMLData Transformations Release Webinar Problem #2 We have this… #numeric #text #datetime size rows … time_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1431 4 0 1 56789 445423 … 8891 … … … … … … … 0 1 0 1515656 373 … 1673 Dataset 1 feature_key avg_ms c3f6c8be4f 300 a243ca3c38 1552 14f9d917bc 8891 … … Dataset 2 We want this… #numeric #text #datetime size rows … avg_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1552 4 0 1 56789 445423 … 8891 … … … … … … … Dataset !19
  • 20.
    BigML, Inc BigMLData Transformations Release Webinar Joins • Datasets to join need to have a field in common • joining sales and demographics on customer_id • joining employee and budget details on department_id • Datasets to join do not need to have the same dimensions • Joins can be performed in several ways • Left, Right, Inner, Outer… !20
  • 21.
    BigML, Inc BigMLData Transformations Release Webinar Left Join • In a Left join of dataset A to B: • Returns all records from the left A, 
 and the matched records from B • The result is NULL from B, if there is no match. _id field1 1 34 2 56 3 123 4 56 5 79 A _id field2 1 red 2 green 4 blue 6 black BLeft join _id field1 field2 1 34 red 2 56 green 3 123 null 4 56 blue 5 79 null A left join B= A B !21 No “3” or “5”
  • 22.
    BigML, Inc BigMLData Transformations Release Webinar Right Join !22 • In a Right join of dataset A to B: • Returns all records from the right B, 
 and the matched records from A • The result is NULL from A, if there is no match. _id field1 1 34 2 56 3 123 4 56 5 79 A _id field2 1 red 2 green 4 blue 6 black BRight join _id field2 field1 1 red 34 2 green 56 4 blue 56 6 black null A right join B= BA No “6”, “3” unused
  • 23.
    BigML, Inc BigMLData Transformations Release Webinar Inner Join • In an Inner join of dataset A to B: • Returns only records from the left A, 
 that match records from B • If there is no match between A and B, the record is ignored A B _id field1 1 34 2 56 3 123 4 56 5 79 A _id field2 1 red 2 green 4 blue 6 black BInner join _id field1 field2 1 34 red 2 56 green 4 56 blue A inner join B= !23 “3” and “5” unused “6” unused
  • 24.
    BigML, Inc BigMLData Transformations Release Webinar Full Outer Join • In a Full join of dataset A to B: • Returns all records from the left A, 
 and records from B • If there is no match in either A and B, the field is null A B _id field1 1 34 2 56 3 123 4 56 5 79 A _id field2 1 red 2 green 4 blue 6 black Bfull join _id field1 field2 1 34 red 2 56 green 3 123 null 4 56 blue 5 79 null 6 null black A full join B= !24 A No “6” No “3” or “5”
  • 25.
    BigML, Inc BigMLData Transformations Release Webinar Joins !25
  • 26.
    BigML, Inc BigMLData Transformations Release Webinar Problem #3 Left join keeps all records from the left dataset #numeric #text #datetime size rows … avg_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1552 4 0 1 56789 445423 … 8891 … … … … … … … 0 1 0 1515656 373 … 1552 Same Rows !26
  • 27.
    BigML, Inc BigMLData Transformations Release Webinar Remove Duplicates !27
  • 28.
    BigML, Inc BigMLData Transformations Release Webinar Using the API • The UI has a limited set of data transformations • Aggregation (limited), Joins (limited), Remove Duplicates • More functions will be added: concat, ordering, multiple group by • The API supports nearly full SQL syntax for transforming datasets • Nested queries not supported (yet) - e.g. subselects • Better way to perform workflow: • SELECT 10001a, avg(000019) AS avg_status_elapsed FROM DS GROUP BY 10001a • Can perform entire workflow in one SQL using multiple “group by” !28
  • 29.
    BigML, Inc BigMLData Transformations Release Webinar API Transformations !29
  • 30.
    BigML, Inc BigMLData Transformations Release Webinar #numeric #text #datetime size rows … time_ms 12 2 0 74001 200 … 1975 1 0 1 22673 373 … 1552 1056 0 1 9231411 4352 … 7675 Problem #4 How do we add new data and retrain? • When adding a new batch of data • Avoid re-uploading by using a merge • Repeat the entire workflow on the merged dataset using Scriptify #numeric #text #datetime size rows … time_ms 34 0 0 46354 1001 … 300 0 1 0 1515654 373 … 1552 4 0 1 56789 445423 … 8891 !30
  • 31.
    BigML, Inc BigMLData Transformations Release Webinar Merging & Scriptify !31
  • 32.
    BigML, Inc BigMLData Transformations Release Webinar https://bigml.com/releases/summer-2018 More Info !32
  • 33.