Class summary
BigML, Inc 2
Day 2 – Morning sessions
BigML, Inc 3
Basic transformations
Expectations Reality
$
ML-ready data needs work!!!
Any data is always ML-ready
What does ML-ready mean?
●
Machine Learning algorithms consume instances of the question that you want to model.
Each row must describe one of the instances and each column a property of the instance
●
Fields can be:
– already present in your data
– derived from your data
– generated using other fields
BigML, Inc 4
Basic transformations
●
Select the right model for the problem you want to solve:
Classification, regression, cluster analysis, anomaly detection,
association discovery, topic modeling, etc.
●
Perform cleansing, denormalizing, aggregating, pivoting, and
other data wrangling tasks to generate a collection of instances
relevant to the problem at hand. Finally use a very common format as
output format: CSV
●
Choose the right format to store each type of feature into a field
●
Feature engineering: Using domain knowledge and Machine
Learning expertise, generate explicit features that help to better
represent the instances (Flatline)
ML-ready steps
BigML, Inc 5
Basic transformations
Cleansing: Homogenize missing values and different types in the
same feature, fix input errors, correct semantic issues, etc.
Denormalizing: Data is usually normalized in relational databases,
ML-Ready datasets need the information de-normalized in a single
file/dataset.
Aggregation: When data is stored as individual transactions, as in log
files, an aggregation to get the entity might be needed
Pivoting: Different values of a feature are pivoted to new columns in
the result dataset
Regular time windows: Create new features using values over
different periods of time.
Preprocessing data
BigML, Inc 6
Basic transformations
For numeric features:
– Discretization: percentiles, within percentiles, groups
– Replacement
– Normalization
– Exponentiation
– Shocks (speed of change compared to stdev)
For text features:
– Mispellings
– Length
– Number of subordinate sentences
– Language
– Levenshtein distance
Stacking
Compute a field using non-linear combinations of other fields
Feature engineering
BigML, Inc 7
Basic transformations
●
Define a clear idea of the goal.
●
Understand what ML tasks will achieve the goal.
●
Understand the data structure to perform those ML tasks.
●
Find out what kind of data you have and make it ML-Ready
– where is it, how is it stored?
– what are the features?
– can you access it programmatically?
●
Feature Engineering: transform the data you have into the
data you actually need.
●
Evaluate: Try it on a small scale
●
Accept that you might have to start over….
●
But when it works, automate it!!!
Holistic approach
BigML, Inc 8
Basic transformations
Command line tools:
join, jq, awk, sed, sort, uniq
Automation:
Shell, Python, etc.
Talend
BigML: flatline, bindings, bigmler, API, whizzml
Relational Db:
MySQL
Non-Relational Db:
MongoDB
Tools that help
BigML, Inc 9
Feature Engineering
Data + ML Algorithm, is that enough?
The ML Algorithm only knows about the features in the dataset.
Features can be useless to the algorithm if:
●
They are not correlated to the objective to be predicted
●
Their values change their meaning when combined with other features
For ML Algorithms to work there must be some kind of statistical
relation between some of the features and the objective. Sometimes,
you must transform the available features to find such relations
Feature engineering: the process of transforming raw data into
machine learning ready-data
BigML, Inc 10
Feature Engineering
When do you need Feature Engineering?
●
When the relationship between the feature and the
objective is mathematically unsatisfying
●
When the relationship of a function of two or more features
with the objective is far more relevant than the one of the
original features
●
When there is missing data
●
When the data is time-series, especially when the
previous time period’s objective is known
●
When the data can’t be used for machine learning in the
obvious way (e.g., timestamps, text data)
BigML, Inc 11
Feature Engineering
Mathematical transformations
●
Statistical aggregations (group by, all and all-but)
●
Better categories
– too many detailed categories should be avoided
– ordered categories can be translated to numeric values. The model will be
able to extract more information by partinioning the ordered number range
●
Binning or discretization: consider whether your number is more informative
in ranges (quartiles, deciles, percentiles) even for the objective field
●
Linearization: non-important for decision trees but can be for logistic
regression (watch out for exponential distributions)
Missing data
●
Missing value induction (replace missings with common values: mean,
median, mode, even with a Machine Learning model)
●
Missing values presence can be informative, so this can be added as a new
feature
BigML, Inc 12
Feature Engineering
Time-series transformations
●
Better objective (percent change instead of absolute
values)
●
Deltas from previous reference time points
●
Deltas from moving average (time windows)
●
Recent Volatility...
Problem: Exponential explosion of possible transformations
Caveats:
●
The regularity in time of the points has to match your training data
●
You have to keep track of past points to compute your windows
●
Really easy to get information leakage by including your objective in a
window computation (and can be very hard to detect)!
BigML, Inc 13
Feature Engineering
Date-time features
●
Cannot be used “as is” in a model. It's a collection of features. BigML is able to
decompose them automatically when they are provided in the most usual
formats. With Flatline, you can decompose them all.
●
Date-time predicates that the computer does not know (some of them, domain
dependent): Working hours? Daylight? Is rush hour?...
Text features
●
Bag of words: a new feature is associated to each word in the document
●
Tokenization: how do we select tokens? Do we want n-grams? What about
numbers?
●
Stemming: grouping forms of the same word in a unique term
●
Length
●
Text predicates: Dollar amounts? Dates? Salutations? Please and Thank you?
BigML, Inc 14
Feature Engineering
Machine Learning for Feature engineering
Latent Dirichlet Allocation
• Learn word distributions for topics
• Infer topic scores for each document
• Use the topic scores as features to a model (dimensional
reduction)
Distance to cluster Centroids
Stacked Generalization: Classifiers provide new features
BigML, Inc 15
Day 2 – Evening sessions
BigML, Inc 16
REST API, bindings and basic workflows
Academics Real world
How do Machine Learning Workflows look like?
We need high-level tools to face the real world workflows by growing in:
●
Automation
●
Abstraction
BigML, Inc 17
REST API, bindings and basic workflows
The foundations
●
REST API first applications: Standards in software development.
First level of abstraction
Client side tools
●
Web UI: Sitting on top of the REST API. Human-friendly access and
visualizations for all the Machine Learning resources. Workflows must
be defined and executed step by step. Second level of abstraction.
●
Bindings: Sitting on top of the REST API. Fine-grained accessors for
the REST API calls. Workflows must be defined and executed step by
step. Second level of abstraction.
●
BigMLer: Relying on the bindings. High-level syntax. Entire workflows
can be created in only one command line. Third level of abstraction.
BigML, Inc 18
REST API, bindings and basic workflows
.
BigMLer automation
●
Basic 1-click workflows in one command line
●
Rich parameterized workflows: feature selection, cross-validation, etc.
●
Models are downloaded to your laptop, tablet, cell phone, etc. once
and can be used offline to create predictions
Still..
Great for local predictions
BigML, Inc 19
REST API, bindings and basic workflows
.
Problems of client-side solutions
●
Complexity Lots of details outside the problem
domain
●
Reuse No inter-language compatibility
●
Scalability Client-side workflows hard to optimize
●
Extensibility BigMLer hides complexity at the cost of
flexibility
●
Not enough abstraction
BigML, Inc 20
REST API, bindings and basic workflows
.Solution: bringing automation and abstraction to the server-side
●
DSL for ML workflow automation
●
Framework for scalable, remote execution of ML workflows
Sophisticated server-side optimization
Out-of-the-box scalability
Client-server brittleness removed
Infrastructure for creating and sharing ML scripts and libraries
WhizzML
BigML, Inc 21
REST API, bindings and basic workflows
.
WhizzML's new REST API resources:
Scripts: Executable code that describes an actual
workflow, taking a list of typed inputs and producing
a list of outputs.
Executions: Given a script and a complete set of
inputs, the workflow can be executed and its outputs
generated.
Libraries: A collection of WhizzML definitions that
can be imported by other libraries or scripts.
BigML, Inc 22
REST API, bindings and basic workflows
Scripts
Creating scripts
●
Usable by any binding (from any language)
●
Built-in parallelization
●
BigML resources management as primitives of the language
●
Complete programming language for workflow definition
Using scripts
Web UI
Bindings
BigMLer
WhizzML
BigML, Inc 23
Advanced WhizzML workflows
WhizzML offers:
●
Primitives for all ML resources: (datasets, models, clusters, etc.)
●
A complete programming language to compose at will these ML resources.
●
Parallelization and Scalability built-in.
This empowers the user to benefit from:
●
Automated feature engineering: Best-first feature selection.
●
Automated configuration choice: Randomized parameter optimization, SMACdown.
●
Complex algorithms as 1-click: Stacked generalization, Boosting.
All of them can be shared, reproduced and reused as one more BigML
resource in a language-agnostic way.
BigML, Inc 24
Advanced WhizzML workflows
f5 fn
... ...
......
... ...
f5 f7 f5 fn
... ...
......
... ...
f5 f1
Selected
fields
()
(f5)
The best score
is obtained for
the model with (f5)
The best score
is obtained for
the model with (f5 f7)
Following iterations don't improve the score for the model
with (f5 f7), so the process stops
Step 1
Step 2
f1
Best-first feature selection
BigML, Inc 25
Advanced WhizzML workflows
A new dataset is generated
with the predictions for the
hold out data
A new metamodel is created
from this dataset
50%
Hold out
Stacked generalization
BigML, Inc 26
Advanced WhizzML workflows
Configuration
random
generator
... ...
Best
score
Process stops when you reach the expected performance
or the user-given iterations limit
+
Randomized parameter optimization
BigML, Inc 27
Advanced WhizzML workflows
Configuration
random
generator
... ...
+ New configurations are filtered
according to the predictions
of the model of performances
Only promising
configurations are analyzed
SMACdown
BigML, Inc 28
Advanced WhizzML workflows
… …
The final model is an ensemble of models
T0
F0
T1
F1
T2
F2
F8
T8
Boosting
BigML, Inc 29
Advanced WhizzML workflows
Script it once, for everybody anywhere
Publish scripts
in the gallery
Add scripts to
your menus

BSSML16 L10. Summary Day 2 Sessions

  • 1.
  • 2.
    BigML, Inc 2 Day2 – Morning sessions
  • 3.
    BigML, Inc 3 Basictransformations Expectations Reality $ ML-ready data needs work!!! Any data is always ML-ready What does ML-ready mean? ● Machine Learning algorithms consume instances of the question that you want to model. Each row must describe one of the instances and each column a property of the instance ● Fields can be: – already present in your data – derived from your data – generated using other fields
  • 4.
    BigML, Inc 4 Basictransformations ● Select the right model for the problem you want to solve: Classification, regression, cluster analysis, anomaly detection, association discovery, topic modeling, etc. ● Perform cleansing, denormalizing, aggregating, pivoting, and other data wrangling tasks to generate a collection of instances relevant to the problem at hand. Finally use a very common format as output format: CSV ● Choose the right format to store each type of feature into a field ● Feature engineering: Using domain knowledge and Machine Learning expertise, generate explicit features that help to better represent the instances (Flatline) ML-ready steps
  • 5.
    BigML, Inc 5 Basictransformations Cleansing: Homogenize missing values and different types in the same feature, fix input errors, correct semantic issues, etc. Denormalizing: Data is usually normalized in relational databases, ML-Ready datasets need the information de-normalized in a single file/dataset. Aggregation: When data is stored as individual transactions, as in log files, an aggregation to get the entity might be needed Pivoting: Different values of a feature are pivoted to new columns in the result dataset Regular time windows: Create new features using values over different periods of time. Preprocessing data
  • 6.
    BigML, Inc 6 Basictransformations For numeric features: – Discretization: percentiles, within percentiles, groups – Replacement – Normalization – Exponentiation – Shocks (speed of change compared to stdev) For text features: – Mispellings – Length – Number of subordinate sentences – Language – Levenshtein distance Stacking Compute a field using non-linear combinations of other fields Feature engineering
  • 7.
    BigML, Inc 7 Basictransformations ● Define a clear idea of the goal. ● Understand what ML tasks will achieve the goal. ● Understand the data structure to perform those ML tasks. ● Find out what kind of data you have and make it ML-Ready – where is it, how is it stored? – what are the features? – can you access it programmatically? ● Feature Engineering: transform the data you have into the data you actually need. ● Evaluate: Try it on a small scale ● Accept that you might have to start over…. ● But when it works, automate it!!! Holistic approach
  • 8.
    BigML, Inc 8 Basictransformations Command line tools: join, jq, awk, sed, sort, uniq Automation: Shell, Python, etc. Talend BigML: flatline, bindings, bigmler, API, whizzml Relational Db: MySQL Non-Relational Db: MongoDB Tools that help
  • 9.
    BigML, Inc 9 FeatureEngineering Data + ML Algorithm, is that enough? The ML Algorithm only knows about the features in the dataset. Features can be useless to the algorithm if: ● They are not correlated to the objective to be predicted ● Their values change their meaning when combined with other features For ML Algorithms to work there must be some kind of statistical relation between some of the features and the objective. Sometimes, you must transform the available features to find such relations Feature engineering: the process of transforming raw data into machine learning ready-data
  • 10.
    BigML, Inc 10 FeatureEngineering When do you need Feature Engineering? ● When the relationship between the feature and the objective is mathematically unsatisfying ● When the relationship of a function of two or more features with the objective is far more relevant than the one of the original features ● When there is missing data ● When the data is time-series, especially when the previous time period’s objective is known ● When the data can’t be used for machine learning in the obvious way (e.g., timestamps, text data)
  • 11.
    BigML, Inc 11 FeatureEngineering Mathematical transformations ● Statistical aggregations (group by, all and all-but) ● Better categories – too many detailed categories should be avoided – ordered categories can be translated to numeric values. The model will be able to extract more information by partinioning the ordered number range ● Binning or discretization: consider whether your number is more informative in ranges (quartiles, deciles, percentiles) even for the objective field ● Linearization: non-important for decision trees but can be for logistic regression (watch out for exponential distributions) Missing data ● Missing value induction (replace missings with common values: mean, median, mode, even with a Machine Learning model) ● Missing values presence can be informative, so this can be added as a new feature
  • 12.
    BigML, Inc 12 FeatureEngineering Time-series transformations ● Better objective (percent change instead of absolute values) ● Deltas from previous reference time points ● Deltas from moving average (time windows) ● Recent Volatility... Problem: Exponential explosion of possible transformations Caveats: ● The regularity in time of the points has to match your training data ● You have to keep track of past points to compute your windows ● Really easy to get information leakage by including your objective in a window computation (and can be very hard to detect)!
  • 13.
    BigML, Inc 13 FeatureEngineering Date-time features ● Cannot be used “as is” in a model. It's a collection of features. BigML is able to decompose them automatically when they are provided in the most usual formats. With Flatline, you can decompose them all. ● Date-time predicates that the computer does not know (some of them, domain dependent): Working hours? Daylight? Is rush hour?... Text features ● Bag of words: a new feature is associated to each word in the document ● Tokenization: how do we select tokens? Do we want n-grams? What about numbers? ● Stemming: grouping forms of the same word in a unique term ● Length ● Text predicates: Dollar amounts? Dates? Salutations? Please and Thank you?
  • 14.
    BigML, Inc 14 FeatureEngineering Machine Learning for Feature engineering Latent Dirichlet Allocation • Learn word distributions for topics • Infer topic scores for each document • Use the topic scores as features to a model (dimensional reduction) Distance to cluster Centroids Stacked Generalization: Classifiers provide new features
  • 15.
    BigML, Inc 15 Day2 – Evening sessions
  • 16.
    BigML, Inc 16 RESTAPI, bindings and basic workflows Academics Real world How do Machine Learning Workflows look like? We need high-level tools to face the real world workflows by growing in: ● Automation ● Abstraction
  • 17.
    BigML, Inc 17 RESTAPI, bindings and basic workflows The foundations ● REST API first applications: Standards in software development. First level of abstraction Client side tools ● Web UI: Sitting on top of the REST API. Human-friendly access and visualizations for all the Machine Learning resources. Workflows must be defined and executed step by step. Second level of abstraction. ● Bindings: Sitting on top of the REST API. Fine-grained accessors for the REST API calls. Workflows must be defined and executed step by step. Second level of abstraction. ● BigMLer: Relying on the bindings. High-level syntax. Entire workflows can be created in only one command line. Third level of abstraction.
  • 18.
    BigML, Inc 18 RESTAPI, bindings and basic workflows . BigMLer automation ● Basic 1-click workflows in one command line ● Rich parameterized workflows: feature selection, cross-validation, etc. ● Models are downloaded to your laptop, tablet, cell phone, etc. once and can be used offline to create predictions Still.. Great for local predictions
  • 19.
    BigML, Inc 19 RESTAPI, bindings and basic workflows . Problems of client-side solutions ● Complexity Lots of details outside the problem domain ● Reuse No inter-language compatibility ● Scalability Client-side workflows hard to optimize ● Extensibility BigMLer hides complexity at the cost of flexibility ● Not enough abstraction
  • 20.
    BigML, Inc 20 RESTAPI, bindings and basic workflows .Solution: bringing automation and abstraction to the server-side ● DSL for ML workflow automation ● Framework for scalable, remote execution of ML workflows Sophisticated server-side optimization Out-of-the-box scalability Client-server brittleness removed Infrastructure for creating and sharing ML scripts and libraries WhizzML
  • 21.
    BigML, Inc 21 RESTAPI, bindings and basic workflows . WhizzML's new REST API resources: Scripts: Executable code that describes an actual workflow, taking a list of typed inputs and producing a list of outputs. Executions: Given a script and a complete set of inputs, the workflow can be executed and its outputs generated. Libraries: A collection of WhizzML definitions that can be imported by other libraries or scripts.
  • 22.
    BigML, Inc 22 RESTAPI, bindings and basic workflows Scripts Creating scripts ● Usable by any binding (from any language) ● Built-in parallelization ● BigML resources management as primitives of the language ● Complete programming language for workflow definition Using scripts Web UI Bindings BigMLer WhizzML
  • 23.
    BigML, Inc 23 AdvancedWhizzML workflows WhizzML offers: ● Primitives for all ML resources: (datasets, models, clusters, etc.) ● A complete programming language to compose at will these ML resources. ● Parallelization and Scalability built-in. This empowers the user to benefit from: ● Automated feature engineering: Best-first feature selection. ● Automated configuration choice: Randomized parameter optimization, SMACdown. ● Complex algorithms as 1-click: Stacked generalization, Boosting. All of them can be shared, reproduced and reused as one more BigML resource in a language-agnostic way.
  • 24.
    BigML, Inc 24 AdvancedWhizzML workflows f5 fn ... ... ...... ... ... f5 f7 f5 fn ... ... ...... ... ... f5 f1 Selected fields () (f5) The best score is obtained for the model with (f5) The best score is obtained for the model with (f5 f7) Following iterations don't improve the score for the model with (f5 f7), so the process stops Step 1 Step 2 f1 Best-first feature selection
  • 25.
    BigML, Inc 25 AdvancedWhizzML workflows A new dataset is generated with the predictions for the hold out data A new metamodel is created from this dataset 50% Hold out Stacked generalization
  • 26.
    BigML, Inc 26 AdvancedWhizzML workflows Configuration random generator ... ... Best score Process stops when you reach the expected performance or the user-given iterations limit + Randomized parameter optimization
  • 27.
    BigML, Inc 27 AdvancedWhizzML workflows Configuration random generator ... ... + New configurations are filtered according to the predictions of the model of performances Only promising configurations are analyzed SMACdown
  • 28.
    BigML, Inc 28 AdvancedWhizzML workflows … … The final model is an ensemble of models T0 F0 T1 F1 T2 F2 F8 T8 Boosting
  • 29.
    BigML, Inc 29 AdvancedWhizzML workflows Script it once, for everybody anywhere Publish scripts in the gallery Add scripts to your menus