Introduction, Terminology and concepts, Introduction to statistics, Central tendencies and distributions, Variance, Distribution properties and arithmetic, Samples/CLT, Basic machine learning algorithms, Linear regression, SVM, Naive Bayes
3. β’ ππ
π¦
ordered sets of size π of cost values for π¦ β π
β’ π: π β π a function that is optimized
β’ π(ππ
π¦
|π, π, π) the conditional probability of getting ππ
π¦
by π times running algorithm π
on the function π
Theorem: For any pair of algorithms π1 and π2
Οπ π ππ
π¦
π, π, π1 = Οπ π(ππ
π¦
|π, π, π2)
All algorithms are equal
4. βif an algorithm does
particularly well on average
for one class of problems then
it must do worse on average
over the remaining problemsβ
David H. Wolpert and William G. Macready: No Free Lunch Theorems for
Optimization, IEEE Transactions on Evolutionary Computation, 1(1):67-82
5. βThere is no βone wayβ to do data analysis
β’ But there are some standard techniques that often
perform well
β’ Many factors influence the suitable techniques
β’ Data
β’ Problem to be solved
β’ Available resources
β’ β¦
β Broad portfolio of data analysis techniques required
6. Category Techniques Covered Problem to be solved
Association Rules Apriori Relationships between items
Clustering K-Means Clustering
DB Scan
Grouping of similar items
Identification of structures
Classification K-nearest Neighbor
Decision Trees
Random Forests
Logistic Regression
Naive Bayes
Support Vector Machines
Neural Networks
Assignment of labels to objects
Regression Linear Regression
Ridge
Lasso
Relationship between outcome and
inputs
Time Series Analysis ARMA Identification of temporal structures
Forecasting of temporal processes
Text Mining Bag-of-Words
Stemming/Lemmatization
TF-IDF
Analysis of textual data
8. β’ Definition after Tom M. Mitchell [2]:
β’ A computer program is said to learn from experience πΈ with respect to some class of tasks π and
performance measure π, if its performance at tasks in π, as measured by P, improves with
experience πΈ.
β’ Relation to the data analysis techniques
β’ Experience πΈ: our data
β’ Task π: clustering/association mining/classification/β¦
β’ Performance Measure π: depends on tasks
9. β’ How would you describe this picture with general concepts?
Blue background
Has a fin
Oval body
Black top, white
bottom
10. β’ π is the object space
β’ π is the feature map
β’ β± is the feature space
β’ β± = {π π , π β π}
β’ Example:
β’ Five-dimensional space with dimensions as above
β’ π "whalepicture" = (π‘ππ’π, ππ£ππ, πππππ, π€βππ‘π, πππ’π)
hasFin
shape
colorTop
colorBottom
background
Object Features
=
=
=
=
=
true
oval
black
white
blue
Feature map
π: π β β±
11. β’ Stevensβ levels of measurement
Scale Property Allowed Operations Example
Nominal Classification or
membership
=, β Color as βblackβ, βwhiteβ and βblueβ
Ordinal Comparison or
levels
=, β , >, < Size in βsmallβ, βmediumβ, and βlargeβ
Interval Differences or
affinities
=, β , >, <, +, β Dates, temperatures,
discrete numeric values
Ratio Magnitudes or
amounts
=, β , >, <, +, β,β ,/ Size in cm, duration in seconds,
continuous numeric values
Categorical
12. β’ Many algorithms can only work with numeric features
β’ Encode categorical features as binary numeric features
β’ Example: π₯ β {small, medium, large }
β’ Encode as three variables π₯π ππππ
, π₯πππππ’π
, π₯
πππππ
β’ π₯π ππππ
= α
1 ππ π₯ = small
0 ππ‘βπππ€ππ π
, β¦
β’ Can also use one variable less, remaining case is encoded by all zeros
β’ This is called One-Hot-Encoding
13. β’ Instances of objects described by their features
β’ Supervised learning if the value of interest is known
β Classification, regression
β’ Otherwise unsupervised learning
β Clustering, Association Rule Mining
π(π)
hasFin shape colorTop colorBottom background value of interest
true oval black black blue whale
false rectangle brown brown green bear
β¦ β¦ β¦ β¦ β¦ β¦
14. β’ Data for the evaluation of analysis results
β’ Same distribution as training data
β’ Training data β Test data
β’ Evaluate generalization
β’ Avoid overfitting
β’ Analysis results only valid on training data
β’ Different and not working on unseen data
β’ Test data often difficult to obtain
And where do I
get the test data?
15. β’ Data not used for training at all
β’ Commonly used hold out data sizes
β’ 50% of all data
β’ 33% of all data
β’ 25% of all data in case a validation set is used
β’ Example:
β’ Nine months of customer transactions available
β’ First six months as training data
β’ Last three months as test data
Depends a lot on available data!
16. β’ Create π partitions of available data
β’ One partition for testing, all others for training
β’ Estimate performance by averaging over the iterations
Partition 2 Partition 3 Partition 4 Partition 5
Partition 1 Partition 3 Partition 4 Partition 5
Partition 1 Partition 2 Partition 4 Partition 5
Partition 1 Partition 2 Partition 3 Partition 5
Partition 1 Partition 2 Partition 3 Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Available data
Test data Training data
Iteration 1
Iteration 2
Iteration 3
Iteration 3
Iteration 5
18. β’ No generic algorithm for all problems
β’ Objects are described by features
β’ Features are used for learning about objects
β’ Data usually split into different sets for different purposes
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52. A normal distribution shows the probability density for a population of continuous data (for example
height in cm for all NBA players).
In other words, it shows how likely is it that any player from the NBA is of a certain height.
Most players are around the mean/average height, fewer are much taller, or much shorter.
A normal distribution is symmetrical both sides of the mean.
You might also see this referred to as a Gaussian Distribution!
53. The spread of the values in our population is measured using a metric called standard deviation.
In addition, we also use Empirical Rule which tells us that...
β’68% of the values will fall between 1 standard deviation above and below the mean
β’95% of the values will fall between 2 standard deviations above and below the mean
β’99.7% of the values will fall between 3 standard deviations above and below the mean
55. Just like a normal distribution, a t-distribution is symmetrical around the mean, and the breadth is
based around the deviation within the data.
While a normal distribution works with a population, a t-distribution is designed for situations
where sample size is small.
The shape of the t-distribution becomes broader as the sample size decreases, to take into
account the extra uncertainty we are faced with.
The shape of a t-distribution relates to the number of degrees of freedom which is calculated as
the sample size minus one.
As the sample size, and thus the degrees of freedom get larger, the t-distribution tends towards
a normal distribution -as with a larger sample weβre more certain around estimating the true
population statistics.
56.
57. A Binomial Distribution can end up looking a lot like the shape of a normal
distribution.
The main difference is that instead of plotting continuous data, it instead plots
a distribution of two possible discrete outcomes
for example, the results from flipping a coin.
Imagine flipping a coin 10 times, and from those 10 flips, noting down how
many were "Heads".
It could be any number between 1 and 10.
Now imagine repeating that task 1,000 times...
If the coin we are using is indeed fair (not biased to heads or tails) then the
distribution of outcomes should start to look the plot above.
In the vast majority of cases, we get 4, 5, or 6 "heads" from each set of 10
flips, and the likelihood of getting more extreme results is much rarer!
58. The Bernoulli Distribution is a special case of the
Binomial Distribution.
It considers only two possible outcomes, success or
failure, true or false.
Itβs a really simple distribution, but worth knowing!
In the example below weβre looking at the probability of
rolling a 6 with a standard die.
If we roll a die many, many times, we should end up with
a probability of rolling a 6, 1 out of every 6 times (or 16.7%)
and
thus a probability of not rolling a 6, in other words rolling a
1,2,3,4 or 5, 5 times out of 6 (or 83.3%) of the time!
59. A Uniform Distribution is a distribution in which all events
are equally likely to occur.
Below, weβre looking at the results from rolling a die many,
many times.
Weβre looking at which number we got on each roll and
tallying these up.
If we roll the die enough times (and the die is fair) we
should end up with a completely uniform probability where
the chance of getting any outcome is exactly the same.
60. A Poisson Distribution is a discrete distribution similar to the
Binomial Distribution (in that weβre plotting the probability of
whole numbered outcomes).
Unlike the other distributions we have seen however, this one is
not symmetrical it is instead bounded between 0 and infinity.
The Poisson distribution describes the number of events or
outcomes that occur during some fixed interval.
Most commonly this is a time interval like in our example below
where we are plotting the distribution of sales per hour in a
shop.
61.
62.
63.
64.
65. In Statistics, a Hypothesis Test is used to assess & understand the plausibility, or likelihood
of some assumed viewpoint (a hypothesis) -based upon data.
In other words, we are using Statistics to test or investigate ideas.
Let's say we're the coach of an NBA Basketball team -we might find ourselves with the below dilemma.
The New Sensation >> Games Played: 2, Shooting Rate: 60%
The Current Star >>Games Played: 102, Shooting Rate: 49%
My new player seems to be a better shooter...
but she's only played 2 games.
I need more confidence before making a big change! Let's test it!
66. There are many, many different types of hypothesis test -each of which are appropriate for
different types of data, and/or dealing with different scenarios & comparisons.
Here we will cover several commonly used tests...
β’ One Sample T-Test
β’ Independent Samples T-Test
β’ Paired Samples T-Test
β’ Chi-Squared Test of Independence
67. A One Sample T-Test looks to assess differences between a sample, and the entire population
from which that sample resides.
If we were the coach of an NBA Basketball team, this might help us with the following question...
Is the average height (cm) of my team higher than that of the entire NBA?
68. An Independent Samples T-Test looks to assess differences between a sample, and another sample.
If we were the coach of an NBA Basketball team, this might help us with the following question...
Is the average height (cm) of my team higher than that of our rival team?
69. A Paired Samples T-Test looks to assess differences between a sample, and that same sample, at
another point in time.
If we were the coach of an NBA Basketball team, this might help us with the following question...
Has the average jumping height (cm) for my team increased after our 4-week fitness programme?
70. The Chi-Square Test for Independence is used to determine if there is a significant relationship
between two categorical variables.
It examines the actual and expected frequencies to determine if there is a dependence between the two
variables.
If we were the coach of an NBA Basketball team, this might help us with the following question...
Is the 3-point shooting % of my newly signed player, higher than that of my current star player?
71. The outcome of a hypothesis test (in other words whether or not we support our initial
assumption) is influenced by our Acceptance Criteria.
This Acceptance Criteriais often based upon a p-value.
So, let's talk about what a p-value is, and what a p-value is not!
A p-value essentially helps us assess whether the results of some finding or test we have
conducted are either, likely to be ordinary, or likely to be strange.
72. β’In other words -these results weβve got -do we think weβd get similar results if we ran many
more tests, or do we think our results might be something of a rarity? With an Acceptance
Criteria of 0.05, we are essentially saying that we want a likelihood of 5% (or lower) that our
result happened by chance.
β’A p-value is not a probability of an event occurring, it is a probability or likelihood of seeing a
different result if we were to sample many times.
β’A p-value does not tell us how different two samples are. Two samples with the
same difference in means, but larger/smaller samples sizes will get different p-values. A p-value
is instead telling us how likely it is that they are different (or in other words how confident we can
be that they are different)
73. It is important to lay down our hypotheses/assumptions and a required level of confidence before
running the test itself.
Generally, we specify 3 things; the Null Hypothesis, the Alternate Hypothesis, and our Acceptance
Criteria...
β’Null Hypothesis(H0):The Null Hypothesis is where we state our initial viewpoint. In statistics, and
specifically hypothesis testing, our initial viewpoint is always that the result is purely by chance or that
there is no relationship or association between two outcomes or groups.
β’Alternate Hypothesis(HA):The Alternate Hypothesis is essentially the opposite viewpoint to the Null
Hypothesis -that the result is not by chance, or that there is a relationship between two outcomes or
groups.
β’Acceptance Criteria: The Acceptance Criteria is the threshold we set that will determine if
there is enough evidence to support the Null Hypothesis. This is often set to a p-value of 0.05 but it
does not have to be. If we want more certainty, we can set this to a lower value.