This example is based on the Wine Quality dataset from the
University of California Irvine Machine Learning Repository:
• Current cloud providers (Microsoft, Amazon, Google, …)
have interest to sell computing power as API
• Machine-learning takes a lot of computing power
They have the interest to make it the next buzz-word.
They made cloud a buzz-word, they can do it again.
However, this time we won’t use any APIs, but an open source tool called Accord.NET.
A case for machine learning
We have some existing sample data.
We want to estimate a variable known in the sample data,
but not in the real life.
We expect that the real results will follow the sample data.
Randomize the sample data rows order and split it to two
1) Training set
• Used to find the correct model.
2) Model evaluation set
• Used to verify that the model works to data outside the trained samples
A sample dataset:
There is just a parameters of wines and a people
voted quality from 0 to 10:
Can we estimate a quality of non-listed wine based
on the features we know?
acidity citric acid residual sugar chlorides
dioxide density pH sulphates alcohol quality
7.4 0.7 0 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5
7.3 0.65 0 1.2 0.065 15 21 0.9946 3.39 0.47 10 7
Creating a linear regression over
one feature is relatively simple.
y = k x + b
The dataset has a large amount of
wines, with different alcohol levels
But the dataset has 10 other
features also, so how to make a
regression over combined 11
variables? Takes forever…?
y = k1 x1 + k2 x2 + … + kn xn + b Original picture from: http://brandewinder.com/2016/08/06/gradient-boosting-part-1/
Accord .NET cancer example
Age Smokes Had cancer
55 0 FALSE
28 0 FALSE
65 1 FALSE
46 0 TRUE
86 1 TRUE
56 1 TRUE
85 0 FALSE
33 0 FALSE
21 1 FALSE
42 1 TRUE
Feature Odd ratio
Calculation y(x0, x1) =
Instead of combining slopes, create a combination
Estimating a few (discrete) categories based on
combination of decision nodes.
What method should I choose?
PH > 3.5
Alcohol > 10.6
Manual example and theory:
Figure has just 2 stumps, but real life AI can
generate huge trees.
Use-case: Quality for our event’s wine from Alko
Data from Alko analysis laboratory, wine entry L2BIBS34016:
In Finnish In English
Alk-% 12,01 Alcohol 12.01
Sokeri 3,5 g/l Sugar 3.5
Haihtuvat hapot 0,5 g/l Volatile acidity 0.5
Kokonaisrikki 96 mg/l Total sulfur 96
Vapaa rikki 36 mg/l Free sulfur 36
Sitruunahappo 0,045 g/l Citric acid 0.045
• This sample is from Chile and the sample data is from Italy, so our algorithm has to be able to
work outside the dataset.
• Parameter mismatch:
1) Convert parameters,
2) Remove parameter from learning process
Measure the error, effect to model quality
We don’t have Mean
Fixed acidity 8.32 g/l
Chlorides 0.087 g/l
Density 0.9967 g/l
Sulphates 0.66 g/l
Extra data Known
Total acids 4.62 g/l
(Alko provided the data I asked by email)