SlideShare a Scribd company logo
1 of 37
Download to read offline
Class summary
BigML, Inc 2
Day 2 – Morning sessions
BigML, Inc 3
Basic transformations
Expectations
Poul Petersen
Reality
$
ML-ready data needs work!!!
Any data is always ML-ready
What does ML-ready mean?
● Machine Learning algorithms consume instances of the question that
you want to model. Each row must describe one of the instances and
each column a property of the instance
● Fields can be:
– already present in your data
– derived from your data
– generated using other fields
BigML, Inc 4
Basic transformations
●
Define your goal and select the right model for the problem you want
to solve: Classification, regression, cluster analysis, anomaly
detection, association discovery
●
Perform cleansing, denormalizing, aggregating, pivoting, and
other data wrangling tasks to generate a collection of instances
relevant to the problem at hand. Finally use a very common format as
output format: CSV
●
Choose the right format to store each type of feature into a field
●
Feature engineering: Using domain knowledge and Machine
Learning expertise, generate explicit features that help to better
represent the instances (Flatline)
ML-ready steps
BigML, Inc 5
Basic transformations
Cleansing: Homogenize missing values and different types in the
same feature, fix input errors, correct semantic issues, etc.
Denormalizing: Data is usually normalized in relational databases,
ML-Ready datasets need the information de-normalized in a single
file/dataset.
Aggregation: When data is stored as individual transactions, as in log
files, an aggregation to get the entity might be needed
Pivoting: Different values of a feature are pivoted to new columns in
the result dataset
Regular time windows: Create new features using values over
different periods of time
Preprocessing data
BigML, Inc 6
Basic transformations
●
Define a clear idea of the goal.
●
Understand what ML tasks will achieve the goal.
●
Understand the data structure to perform those ML tasks.
●
Find out what kind of data you have and make it ML-Ready
– where is it, how is it stored?
– what are the features?
– can you access it programmatically?
●
Feature Engineering: transform the data you have into the
data you actually need.
●
Evaluate: Try it on a small scale
●
Accept that you might have to start over….
●
But when it works, automate it!!!
Preparation tasks
BigML, Inc 7
Feature Engineering
Adding some domain knowledge to your data by creating new
predicates from the existing features to help ML algorithms
What do ML algorithms know about your fields?
●
Numeric: contain sequences of numbers (no idea about odd/even, prime, etc.)
●
Date-time: contain a timestamp (no idea about weekends, special holidays or
seasons)
●
Categorical: contain an enumeration of values (no relations between them)
●
Text/Items: contain terms (no relations between them)
Features can be useless to the algorithm if:
●
They are not correlated to the objective to be predicted
●
Their values change their meaning when combined with other features
For ML Algorithms to work there must be some kind of statistical relation between
some of the features and the objective. Sometimes, you must transform the
available features to find such relations
BigML, Inc 8
Feature Engineering
When do you need Feature Engineering?
●
When the relationship between the feature and the
objective is mathematically unsatisfying
●
When the relationship of a function of two or more features
with the objective is far more relevant than the one of the
original features
●
When there is missing data
●
When the data is time-series, especially when the previous
time period’s objective is known
●
When the data can’t be used for machine learning in the
obvious way (e.g., timestamps, text data)
BigML, Inc 9
Feature Engineering
For numeric features:
– Discretization: percentiles, within percentiles, groups
– Replacement of missings
– Normalization
– Exponentiation, logarithms, etc.
– Casting to categorical, integer or real
– Statistics
– Shocks (speed of change compared to stdev)
For text features:
– Mispellings
– Length
– Number of subordinate sentences
– Language
– Levenshtein distance
BigML, Inc 10
Feature Engineering
Date-time features
●
Cannot be used “as is” in a model. It's a collection of features. BigML is able to
decompose them automatically when they are provided in the most usual
formats. With Flatline, you can decompose them all.
●
Date-time predicates that the computer does not know (some of them, domain
dependent): Working hours? Daylight? Is rush hour?...
Text features
●
Bag of words: a new feature is associated to each word in the document (built-
in in BigML)
●
Tokenization: how do we select tokens? Do we want n-grams? What about
numbers?
●
Stemming: grouping forms of the same word in a unique term
●
Length
●
Text predicates: Dollar amounts? Dates? Salutations? Please and Thank you?
BigML, Inc 11
Feature Engineering
Time-series transformations
●
Better objective (percent change instead of absolute
values)
●
Deltas from previous reference time points
●
Deltas from moving average (time windows)
●
Recent Volatility...
Problem: Exponential explosion of possible transformations
BigML, Inc.
12
● Regressions are typically used to
relate two numeric variables
● But using the proper function we
can relate discrete variables too
Ensembles and Logistic Regressions
How comes we use a regression to classify?
Logistic Regression is a classification ML Algorithm
BigML, Inc.
13
● We should use feature engineering to transform raw
features in linearly related predictors, if needed.
● The ML algorithm searches for the coefficients to
solve the problem
by transforming it into a linear regression problem
In general, the algorithm will find a coefficient per
feature plus a bias coefficient and a missing
coefficient
Ensembles and Logistic Regressions
Assumption: The output is linearly related to the
predictors.
BigML, Inc.
14
Default numeric: Replaces missing numeric values.
Missing numeric: Adds a field for missing numerics.
Bias: Allows an intercept term. Important if P(x=0) != 0
Regularization
L1: prefers zeroing individual coefficients
L2 (default): prefers pushing all coefficients towards zero
Strength “C”: Higher values reduce regularization.
EPS: The minimum error between steps to stop.
Auto-scaling: Ensures that all features contribute equally.
Recommended, unless there is a specific need to not auto-
scale.
Ensembles and Logistic Regressions
Configuration parameters
BigML, Inc.
15
• Multi-class LR: Each class has its own LR computed as
a binary problem (one-vs-the-rest). A set of coefficients is
computed for each class.
• Non-numeric predictors: As LR works for numeric
predictors, the algorithm needs to do some encoding of
the non-numeric features to be able to use them. These
are the field-encodings.
– Categorical: one-shot, dummy encoding, contrast
encoding
– Text and Items: frequencies of terms
● Curvilinear LR: adding quadratic features as new features
Ensembles and Logistic Regressions
Extending the domain for the algorithm
BigML, Inc.
16
Ensembles and Logistic Regressions
Logistic Regressions versus Decision Trees
● Expects a "smooth" linear
relationship with predictors
● LR is concerned with probability
of a discrete outcome.
● Lots of parameters to get
wrong: regularization, scaling,
codings
● Slightly less prone to over-fitting
● Because it fits a shape, might
work better when less data
available if it fulfills the expected
linear relationship.
● Adapts well to ragged non-
linear relationships
● No concern:
classification, regression,
multi-class all fine.
● Virtually parameter free
● Slightly more prone to over-
fitting
● Prefers surfaces parallel to
parameter axes, but given
enough data will discover
any shape.
BigML, Inc.
17
Compared to the other classifiers
● Shares the massive predictional power of decision
trees and ensembles
● Some smooth, multivariate functions are not a
problem (like in LR)
● Can improve some of their cons
But...
● Need massive data to learn every coefficient in a
massive parameter space
The goal is again predicting a classification
Time series and Deepnets
Deepnets are also a classifier (supervised learning)
BigML, Inc.
18
● Low efficiency: The right structure for given data is
not easily found, and most structures are bad
● Difficult interpretability: Nothing like the
interpretability of trees.
● Small data
● Problems that need quick iteration
● Problems easy or not so performance demanding
Time series and Deepnets
Deepnets cons
When it’s not so useful?
BigML, Inc.
19
•
● The time series model solves a forecast problem
● The training data must be a temporal identical
distributed sequence of data (so order in rows is
important!)
● The goal is predicting numeric properties in the
future based on past behaviour.
Time series and Deepnets
Time series are supervised learning models able to
extrapolate to the future the patterns learnt from data in the
past
BigML, Inc.
20
The resulting family of models use exponential smoothing
to fit the past training data and generate the different
components of the solution:
●
Trend: the slope between two consecutive points in time
●
Seasonality: periodically recurrent pattern of variation
●
Error: variations that cannot be described by trend or
seasonality
Each of those can contribute in an additive or multiplicative
way to the particular model.
Time series and Deepnets
BigML, Inc.
21
Each additive or multiplicative combination of these
components generates a different model. Which is the
best? There are some error metrics:
● AIC: Akaike Information Criterion
●
AICc: Corrected Akaike Information Criterion
● BIC: Schwarz Bayesian Information Criterion
● R-squared
And finally, they can be evaluated: Watch out! You need
linear train/test splits to maintain the sequence order
Time series and Deepnets
BigML, Inc.
22
● Forecast of one or many numeric features for a
user-given horizon using all possible ETS models
● The error intervals associated to these forecasts
Time series and Deepnets
Time series outputs
BigML, Inc 23
Day 2 – Evening sessions
BigML, Inc 24
REST API, bindings and basic workflows
jao (José Antonio Ortega)
Academics Real world
How do Machine Learning Workflows look like?
We need high-level tools to face the real world workflows by growing in:
● Automation
● Abstraction
BigML, Inc 25
REST API, bindings and basic workflows
The foundations
●
REST API first applications: Standards in software development.
First level of abstraction
Client side tools
●
Web UI: Sitting on top of the REST API. Human-friendly access and
visualizations for all the Machine Learning resources. Workflows must
be defined and executed step by step. Second level of abstraction.
●
Bindings: Sitting on top of the REST API. Fine-grained accessors for
the REST API calls. Workflows must be defined and executed step by
step. Second level of abstraction.
●
BigMLer: Relying on the bindings. High-level syntax. Entire workflows
can be created in only one command line. Third level of abstraction.
BigML, Inc 26
REST API, bindings and basic workflows
.
BigMLer automation
●
Basic 1-click workflows in one command line
●
Rich parameterized workflows: feature selection, cross-validation, etc.
●
Models are downloaded to your laptop, tablet, cell phone, etc. once
and can be used offline to create predictions
Still..
Great for local predictions
BigML, Inc 27
REST API, bindings and basic workflows
.
Problems of client-side solutions
●
Complexity Lots of details outside the problem
domain
●
Reuse No inter-language compatibility
●
Scalability Client-side workflows hard to optimize
●
Extensibility BigMLer hides complexity at the cost of
flexibility
●
Not enough abstraction
BigML, Inc 28
REST API, bindings and basic workflows
.Solution: bringing automation and abstraction to the server-side
●
DSL for ML workflow automation
●
Framework for scalable, remote execution of ML workflows
Sophisticated server-side optimization
Out-of-the-box scalability
Client-server brittleness removed
Infrastructure for creating and sharing ML scripts and libraries
WhizzML
BigML, Inc 29
REST API, bindings and basic workflows
.
WhizzML's new REST API resources:
Scripts: Executable code that describes an actual
workflow, taking a list of typed inputs and producing
a list of outputs.
Executions: Given a script and a complete set of
inputs, the workflow can be executed and its outputs
generated.
Libraries: A collection of WhizzML definitions that
can be imported by other libraries or scripts.
BigML, Inc 30
REST API, bindings and basic workflows
Scripts
Creating scripts
●
Usable by any binding (from any language)
●
Built-in parallelization
●
BigML resources management as primitives of the language
●
Complete programming language for workflow definition
Using scripts
Web UI
Bindings
BigMLer
WhizzML
BigML, Inc 31
Advanced WhizzML workflows
Charles Parker
WhizzML offers:
● Primitives for all ML resources: (datasets, models, clusters, etc.)
● A complete programming language to compose at will these ML
resources.
● Parallelization and Scalability built-in.
This empowers the user to benefit from:
● Automated feature engineering: Best-first feature selection.
● Automated configuration choice: Randomized parameter
optimization, SMACdown.
● Complex algorithms as 1-click: Stacked generalization, Boosting.
All of them can be shared, reproduced and reused as
one more BigML resource in a language-agnostic way.
BigML, Inc 32
Advanced WhizzML workflows
Selected
fields
Following iterations don't improve the score for the model
with (f5 f7), so the process stops
BigML, Inc 33
Advanced WhizzML workflows
Stacked generalization
BigML, Inc 34
Advanced WhizzML workflows
Process stops when you reach the expected performance
or the user-given iterations limit
Randomized parameter optimization
BigML, Inc 35
Advanced WhizzML workflows
BigML, Inc 36
Advanced WhizzML workflows
… …
The final model is an ensemble of models
T0
F0
T1
F1
T2
F2
F8
T8
Boosting
BigML, Inc 37
Advanced WhizzML workflows
Script it once, for everybody anywhere
Publish scripts
in the gallery
Add scripts to
your menus

More Related Content

What's hot

BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBigML, Inc
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1BigML, Inc
 
BSSML16 L1. Introduction, Models, and Evaluations
BSSML16 L1. Introduction, Models, and EvaluationsBSSML16 L1. Introduction, Models, and Evaluations
BSSML16 L1. Introduction, Models, and EvaluationsBigML, Inc
 
BSSML17 - Feature Engineering
BSSML17 - Feature EngineeringBSSML17 - Feature Engineering
BSSML17 - Feature EngineeringBigML, Inc
 
BSSML17 - Logistic Regressions
BSSML17 - Logistic RegressionsBSSML17 - Logistic Regressions
BSSML17 - Logistic RegressionsBigML, Inc
 
VSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic RegressionVSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic RegressionBigML, Inc
 
BSSML16 L6. Basic Data Transformations
BSSML16 L6. Basic Data TransformationsBSSML16 L6. Basic Data Transformations
BSSML16 L6. Basic Data TransformationsBigML, Inc
 
VSSML16 L5. Basic Data Transformations
VSSML16 L5. Basic Data TransformationsVSSML16 L5. Basic Data Transformations
VSSML16 L5. Basic Data TransformationsBigML, Inc
 
VSSML18. Feature Engineering
VSSML18. Feature EngineeringVSSML18. Feature Engineering
VSSML18. Feature EngineeringBigML, Inc
 
MLSD18. Feature Engineering
MLSD18. Feature EngineeringMLSD18. Feature Engineering
MLSD18. Feature EngineeringBigML, Inc
 
BSSML17 - Time Series
BSSML17 - Time SeriesBSSML17 - Time Series
BSSML17 - Time SeriesBigML, Inc
 
BSSML17 - Ensembles
BSSML17 - EnsemblesBSSML17 - Ensembles
BSSML17 - EnsemblesBigML, Inc
 
BSSML16 L9. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
BSSML16 L9. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...BSSML16 L9. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
BSSML16 L9. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...BigML, Inc
 
BigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML, Inc
 
VSSML16 L6. Feature Engineering
VSSML16 L6. Feature EngineeringVSSML16 L6. Feature Engineering
VSSML16 L6. Feature EngineeringBigML, Inc
 
BSSML16 L8. REST API, Bindings, and Basic Workflows
BSSML16 L8. REST API, Bindings, and Basic WorkflowsBSSML16 L8. REST API, Bindings, and Basic Workflows
BSSML16 L8. REST API, Bindings, and Basic WorkflowsBigML, Inc
 
VSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionVSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionBigML, Inc
 
MLSD18. Ensembles, Logistic Regression, Deepnets
MLSD18. Ensembles, Logistic Regression, DeepnetsMLSD18. Ensembles, Logistic Regression, Deepnets
MLSD18. Ensembles, Logistic Regression, DeepnetsBigML, Inc
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering odsc
 
BigML Fall 2016 Release
BigML Fall 2016 ReleaseBigML Fall 2016 Release
BigML Fall 2016 ReleaseBigML, Inc
 

What's hot (20)

BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature Engineering
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1
 
BSSML16 L1. Introduction, Models, and Evaluations
BSSML16 L1. Introduction, Models, and EvaluationsBSSML16 L1. Introduction, Models, and Evaluations
BSSML16 L1. Introduction, Models, and Evaluations
 
BSSML17 - Feature Engineering
BSSML17 - Feature EngineeringBSSML17 - Feature Engineering
BSSML17 - Feature Engineering
 
BSSML17 - Logistic Regressions
BSSML17 - Logistic RegressionsBSSML17 - Logistic Regressions
BSSML17 - Logistic Regressions
 
VSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic RegressionVSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic Regression
 
BSSML16 L6. Basic Data Transformations
BSSML16 L6. Basic Data TransformationsBSSML16 L6. Basic Data Transformations
BSSML16 L6. Basic Data Transformations
 
VSSML16 L5. Basic Data Transformations
VSSML16 L5. Basic Data TransformationsVSSML16 L5. Basic Data Transformations
VSSML16 L5. Basic Data Transformations
 
VSSML18. Feature Engineering
VSSML18. Feature EngineeringVSSML18. Feature Engineering
VSSML18. Feature Engineering
 
MLSD18. Feature Engineering
MLSD18. Feature EngineeringMLSD18. Feature Engineering
MLSD18. Feature Engineering
 
BSSML17 - Time Series
BSSML17 - Time SeriesBSSML17 - Time Series
BSSML17 - Time Series
 
BSSML17 - Ensembles
BSSML17 - EnsemblesBSSML17 - Ensembles
BSSML17 - Ensembles
 
BSSML16 L9. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
BSSML16 L9. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...BSSML16 L9. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
BSSML16 L9. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
 
BigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML Summer 2016 Release
BigML Summer 2016 Release
 
VSSML16 L6. Feature Engineering
VSSML16 L6. Feature EngineeringVSSML16 L6. Feature Engineering
VSSML16 L6. Feature Engineering
 
BSSML16 L8. REST API, Bindings, and Basic Workflows
BSSML16 L8. REST API, Bindings, and Basic WorkflowsBSSML16 L8. REST API, Bindings, and Basic Workflows
BSSML16 L8. REST API, Bindings, and Basic Workflows
 
VSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly DetectionVSSML16 L3. Clusters and Anomaly Detection
VSSML16 L3. Clusters and Anomaly Detection
 
MLSD18. Ensembles, Logistic Regression, Deepnets
MLSD18. Ensembles, Logistic Regression, DeepnetsMLSD18. Ensembles, Logistic Regression, Deepnets
MLSD18. Ensembles, Logistic Regression, Deepnets
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
BigML Fall 2016 Release
BigML Fall 2016 ReleaseBigML Fall 2016 Release
BigML Fall 2016 Release
 

Similar to VSSML17 Review. Summary Day 2 Sessions

VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2BigML, Inc
 
VSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsVSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsBigML, Inc
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBigML, Inc
 
machine learning workflow with data input.pptx
machine learning workflow with data input.pptxmachine learning workflow with data input.pptx
machine learning workflow with data input.pptxjasontseng19
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesAlice Zheng
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZCharles Vestur
 
MATLAB Assignment Help
MATLAB Assignment HelpMATLAB Assignment Help
MATLAB Assignment HelpEssay Corp
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...confluent
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCgdgsurrey
 
Practical data science
Practical data sciencePractical data science
Practical data scienceDing Li
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysisPramod Toraskar
 
PPT3: Main algorithms and techniques required for implementing Machine Learni...
PPT3: Main algorithms and techniques required for implementing Machine Learni...PPT3: Main algorithms and techniques required for implementing Machine Learni...
PPT3: Main algorithms and techniques required for implementing Machine Learni...akira-ai
 
Into the World of AI GDSC YCCE PPTX.pptx
Into the World of AI GDSC YCCE PPTX.pptxInto the World of AI GDSC YCCE PPTX.pptx
Into the World of AI GDSC YCCE PPTX.pptxGDSCYCCE
 

Similar to VSSML17 Review. Summary Day 2 Sessions (20)

VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2
 
VSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsVSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 Sessions
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 Sessions
 
machine learning workflow with data input.pptx
machine learning workflow with data input.pptxmachine learning workflow with data input.pptx
machine learning workflow with data input.pptx
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to Z
 
MATLAB Assignment Help
MATLAB Assignment HelpMATLAB Assignment Help
MATLAB Assignment Help
 
C3 w5
C3 w5C3 w5
C3 w5
 
Chapter 1- IT.pptx
Chapter 1- IT.pptxChapter 1- IT.pptx
Chapter 1- IT.pptx
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
 
Python and data analytics
Python and data analyticsPython and data analytics
Python and data analytics
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
 
Practical data science
Practical data sciencePractical data science
Practical data science
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
PPT3: Main algorithms and techniques required for implementing Machine Learni...
PPT3: Main algorithms and techniques required for implementing Machine Learni...PPT3: Main algorithms and techniques required for implementing Machine Learni...
PPT3: Main algorithms and techniques required for implementing Machine Learni...
 
Into the World of AI GDSC YCCE PPTX.pptx
Into the World of AI GDSC YCCE PPTX.pptxInto the World of AI GDSC YCCE PPTX.pptx
Into the World of AI GDSC YCCE PPTX.pptx
 
Data science
Data scienceData science
Data science
 

More from BigML, Inc

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingBigML, Inc
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationBigML, Inc
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceBigML, Inc
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesBigML, Inc
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector BigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionBigML, Inc
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLBigML, Inc
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyBigML, Inc
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorBigML, Inc
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsBigML, Inc
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsBigML, Inc
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleBigML, Inc
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIBigML, Inc
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object DetectionBigML, Inc
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image ProcessingBigML, Inc
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureBigML, Inc
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorBigML, Inc
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotBigML, Inc
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...BigML, Inc
 

More from BigML, Inc (20)

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in Manufacturing
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - Automation
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML Compliance
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective Anomalies
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly Detection
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End ML
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven Company
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal Sector
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe Stadiums
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at Scale
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AI
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object Detection
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image Processing
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail Sector
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
 

Recently uploaded

2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Calllward7
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp onlinebalibahu1313
 
Data analytics courses in Nepal Presentation
Data analytics courses in Nepal PresentationData analytics courses in Nepal Presentation
Data analytics courses in Nepal Presentationanshikakulshreshtha11
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理pyhepag
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdfvyankatesh1
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyRafigAliyev2
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfscitechtalktv
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfMichaelSenkow
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxDilipVasan
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理cyebo
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfEmmanuel Dauda
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictJack Cole
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理pyhepag
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group MeetingAlison Pitt
 

Recently uploaded (20)

2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
Data analytics courses in Nepal Presentation
Data analytics courses in Nepal PresentationData analytics courses in Nepal Presentation
Data analytics courses in Nepal Presentation
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 

VSSML17 Review. Summary Day 2 Sessions

  • 2. BigML, Inc 2 Day 2 – Morning sessions
  • 3. BigML, Inc 3 Basic transformations Expectations Poul Petersen Reality $ ML-ready data needs work!!! Any data is always ML-ready What does ML-ready mean? ● Machine Learning algorithms consume instances of the question that you want to model. Each row must describe one of the instances and each column a property of the instance ● Fields can be: – already present in your data – derived from your data – generated using other fields
  • 4. BigML, Inc 4 Basic transformations ● Define your goal and select the right model for the problem you want to solve: Classification, regression, cluster analysis, anomaly detection, association discovery ● Perform cleansing, denormalizing, aggregating, pivoting, and other data wrangling tasks to generate a collection of instances relevant to the problem at hand. Finally use a very common format as output format: CSV ● Choose the right format to store each type of feature into a field ● Feature engineering: Using domain knowledge and Machine Learning expertise, generate explicit features that help to better represent the instances (Flatline) ML-ready steps
  • 5. BigML, Inc 5 Basic transformations Cleansing: Homogenize missing values and different types in the same feature, fix input errors, correct semantic issues, etc. Denormalizing: Data is usually normalized in relational databases, ML-Ready datasets need the information de-normalized in a single file/dataset. Aggregation: When data is stored as individual transactions, as in log files, an aggregation to get the entity might be needed Pivoting: Different values of a feature are pivoted to new columns in the result dataset Regular time windows: Create new features using values over different periods of time Preprocessing data
  • 6. BigML, Inc 6 Basic transformations ● Define a clear idea of the goal. ● Understand what ML tasks will achieve the goal. ● Understand the data structure to perform those ML tasks. ● Find out what kind of data you have and make it ML-Ready – where is it, how is it stored? – what are the features? – can you access it programmatically? ● Feature Engineering: transform the data you have into the data you actually need. ● Evaluate: Try it on a small scale ● Accept that you might have to start over…. ● But when it works, automate it!!! Preparation tasks
  • 7. BigML, Inc 7 Feature Engineering Adding some domain knowledge to your data by creating new predicates from the existing features to help ML algorithms What do ML algorithms know about your fields? ● Numeric: contain sequences of numbers (no idea about odd/even, prime, etc.) ● Date-time: contain a timestamp (no idea about weekends, special holidays or seasons) ● Categorical: contain an enumeration of values (no relations between them) ● Text/Items: contain terms (no relations between them) Features can be useless to the algorithm if: ● They are not correlated to the objective to be predicted ● Their values change their meaning when combined with other features For ML Algorithms to work there must be some kind of statistical relation between some of the features and the objective. Sometimes, you must transform the available features to find such relations
  • 8. BigML, Inc 8 Feature Engineering When do you need Feature Engineering? ● When the relationship between the feature and the objective is mathematically unsatisfying ● When the relationship of a function of two or more features with the objective is far more relevant than the one of the original features ● When there is missing data ● When the data is time-series, especially when the previous time period’s objective is known ● When the data can’t be used for machine learning in the obvious way (e.g., timestamps, text data)
  • 9. BigML, Inc 9 Feature Engineering For numeric features: – Discretization: percentiles, within percentiles, groups – Replacement of missings – Normalization – Exponentiation, logarithms, etc. – Casting to categorical, integer or real – Statistics – Shocks (speed of change compared to stdev) For text features: – Mispellings – Length – Number of subordinate sentences – Language – Levenshtein distance
  • 10. BigML, Inc 10 Feature Engineering Date-time features ● Cannot be used “as is” in a model. It's a collection of features. BigML is able to decompose them automatically when they are provided in the most usual formats. With Flatline, you can decompose them all. ● Date-time predicates that the computer does not know (some of them, domain dependent): Working hours? Daylight? Is rush hour?... Text features ● Bag of words: a new feature is associated to each word in the document (built- in in BigML) ● Tokenization: how do we select tokens? Do we want n-grams? What about numbers? ● Stemming: grouping forms of the same word in a unique term ● Length ● Text predicates: Dollar amounts? Dates? Salutations? Please and Thank you?
  • 11. BigML, Inc 11 Feature Engineering Time-series transformations ● Better objective (percent change instead of absolute values) ● Deltas from previous reference time points ● Deltas from moving average (time windows) ● Recent Volatility... Problem: Exponential explosion of possible transformations
  • 12. BigML, Inc. 12 ● Regressions are typically used to relate two numeric variables ● But using the proper function we can relate discrete variables too Ensembles and Logistic Regressions How comes we use a regression to classify? Logistic Regression is a classification ML Algorithm
  • 13. BigML, Inc. 13 ● We should use feature engineering to transform raw features in linearly related predictors, if needed. ● The ML algorithm searches for the coefficients to solve the problem by transforming it into a linear regression problem In general, the algorithm will find a coefficient per feature plus a bias coefficient and a missing coefficient Ensembles and Logistic Regressions Assumption: The output is linearly related to the predictors.
  • 14. BigML, Inc. 14 Default numeric: Replaces missing numeric values. Missing numeric: Adds a field for missing numerics. Bias: Allows an intercept term. Important if P(x=0) != 0 Regularization L1: prefers zeroing individual coefficients L2 (default): prefers pushing all coefficients towards zero Strength “C”: Higher values reduce regularization. EPS: The minimum error between steps to stop. Auto-scaling: Ensures that all features contribute equally. Recommended, unless there is a specific need to not auto- scale. Ensembles and Logistic Regressions Configuration parameters
  • 15. BigML, Inc. 15 • Multi-class LR: Each class has its own LR computed as a binary problem (one-vs-the-rest). A set of coefficients is computed for each class. • Non-numeric predictors: As LR works for numeric predictors, the algorithm needs to do some encoding of the non-numeric features to be able to use them. These are the field-encodings. – Categorical: one-shot, dummy encoding, contrast encoding – Text and Items: frequencies of terms ● Curvilinear LR: adding quadratic features as new features Ensembles and Logistic Regressions Extending the domain for the algorithm
  • 16. BigML, Inc. 16 Ensembles and Logistic Regressions Logistic Regressions versus Decision Trees ● Expects a "smooth" linear relationship with predictors ● LR is concerned with probability of a discrete outcome. ● Lots of parameters to get wrong: regularization, scaling, codings ● Slightly less prone to over-fitting ● Because it fits a shape, might work better when less data available if it fulfills the expected linear relationship. ● Adapts well to ragged non- linear relationships ● No concern: classification, regression, multi-class all fine. ● Virtually parameter free ● Slightly more prone to over- fitting ● Prefers surfaces parallel to parameter axes, but given enough data will discover any shape.
  • 17. BigML, Inc. 17 Compared to the other classifiers ● Shares the massive predictional power of decision trees and ensembles ● Some smooth, multivariate functions are not a problem (like in LR) ● Can improve some of their cons But... ● Need massive data to learn every coefficient in a massive parameter space The goal is again predicting a classification Time series and Deepnets Deepnets are also a classifier (supervised learning)
  • 18. BigML, Inc. 18 ● Low efficiency: The right structure for given data is not easily found, and most structures are bad ● Difficult interpretability: Nothing like the interpretability of trees. ● Small data ● Problems that need quick iteration ● Problems easy or not so performance demanding Time series and Deepnets Deepnets cons When it’s not so useful?
  • 19. BigML, Inc. 19 • ● The time series model solves a forecast problem ● The training data must be a temporal identical distributed sequence of data (so order in rows is important!) ● The goal is predicting numeric properties in the future based on past behaviour. Time series and Deepnets Time series are supervised learning models able to extrapolate to the future the patterns learnt from data in the past
  • 20. BigML, Inc. 20 The resulting family of models use exponential smoothing to fit the past training data and generate the different components of the solution: ● Trend: the slope between two consecutive points in time ● Seasonality: periodically recurrent pattern of variation ● Error: variations that cannot be described by trend or seasonality Each of those can contribute in an additive or multiplicative way to the particular model. Time series and Deepnets
  • 21. BigML, Inc. 21 Each additive or multiplicative combination of these components generates a different model. Which is the best? There are some error metrics: ● AIC: Akaike Information Criterion ● AICc: Corrected Akaike Information Criterion ● BIC: Schwarz Bayesian Information Criterion ● R-squared And finally, they can be evaluated: Watch out! You need linear train/test splits to maintain the sequence order Time series and Deepnets
  • 22. BigML, Inc. 22 ● Forecast of one or many numeric features for a user-given horizon using all possible ETS models ● The error intervals associated to these forecasts Time series and Deepnets Time series outputs
  • 23. BigML, Inc 23 Day 2 – Evening sessions
  • 24. BigML, Inc 24 REST API, bindings and basic workflows jao (José Antonio Ortega) Academics Real world How do Machine Learning Workflows look like? We need high-level tools to face the real world workflows by growing in: ● Automation ● Abstraction
  • 25. BigML, Inc 25 REST API, bindings and basic workflows The foundations ● REST API first applications: Standards in software development. First level of abstraction Client side tools ● Web UI: Sitting on top of the REST API. Human-friendly access and visualizations for all the Machine Learning resources. Workflows must be defined and executed step by step. Second level of abstraction. ● Bindings: Sitting on top of the REST API. Fine-grained accessors for the REST API calls. Workflows must be defined and executed step by step. Second level of abstraction. ● BigMLer: Relying on the bindings. High-level syntax. Entire workflows can be created in only one command line. Third level of abstraction.
  • 26. BigML, Inc 26 REST API, bindings and basic workflows . BigMLer automation ● Basic 1-click workflows in one command line ● Rich parameterized workflows: feature selection, cross-validation, etc. ● Models are downloaded to your laptop, tablet, cell phone, etc. once and can be used offline to create predictions Still.. Great for local predictions
  • 27. BigML, Inc 27 REST API, bindings and basic workflows . Problems of client-side solutions ● Complexity Lots of details outside the problem domain ● Reuse No inter-language compatibility ● Scalability Client-side workflows hard to optimize ● Extensibility BigMLer hides complexity at the cost of flexibility ● Not enough abstraction
  • 28. BigML, Inc 28 REST API, bindings and basic workflows .Solution: bringing automation and abstraction to the server-side ● DSL for ML workflow automation ● Framework for scalable, remote execution of ML workflows Sophisticated server-side optimization Out-of-the-box scalability Client-server brittleness removed Infrastructure for creating and sharing ML scripts and libraries WhizzML
  • 29. BigML, Inc 29 REST API, bindings and basic workflows . WhizzML's new REST API resources: Scripts: Executable code that describes an actual workflow, taking a list of typed inputs and producing a list of outputs. Executions: Given a script and a complete set of inputs, the workflow can be executed and its outputs generated. Libraries: A collection of WhizzML definitions that can be imported by other libraries or scripts.
  • 30. BigML, Inc 30 REST API, bindings and basic workflows Scripts Creating scripts ● Usable by any binding (from any language) ● Built-in parallelization ● BigML resources management as primitives of the language ● Complete programming language for workflow definition Using scripts Web UI Bindings BigMLer WhizzML
  • 31. BigML, Inc 31 Advanced WhizzML workflows Charles Parker WhizzML offers: ● Primitives for all ML resources: (datasets, models, clusters, etc.) ● A complete programming language to compose at will these ML resources. ● Parallelization and Scalability built-in. This empowers the user to benefit from: ● Automated feature engineering: Best-first feature selection. ● Automated configuration choice: Randomized parameter optimization, SMACdown. ● Complex algorithms as 1-click: Stacked generalization, Boosting. All of them can be shared, reproduced and reused as one more BigML resource in a language-agnostic way.
  • 32. BigML, Inc 32 Advanced WhizzML workflows Selected fields Following iterations don't improve the score for the model with (f5 f7), so the process stops
  • 33. BigML, Inc 33 Advanced WhizzML workflows Stacked generalization
  • 34. BigML, Inc 34 Advanced WhizzML workflows Process stops when you reach the expected performance or the user-given iterations limit Randomized parameter optimization
  • 35. BigML, Inc 35 Advanced WhizzML workflows
  • 36. BigML, Inc 36 Advanced WhizzML workflows … … The final model is an ensemble of models T0 F0 T1 F1 T2 F2 F8 T8 Boosting
  • 37. BigML, Inc 37 Advanced WhizzML workflows Script it once, for everybody anywhere Publish scripts in the gallery Add scripts to your menus