SlideShare a Scribd company logo
1 of 78
Download to read offline
Dr. D. Sugumar
Associate Professor/ECE
β€’ Overview
β€’ Foundational Concepts
β€’ Summary
β€’ π‘‘π‘š
𝑦
ordered sets of size π‘š of cost values for 𝑦 ∈ π‘Œ
β€’ 𝑓: 𝑋 β†’ π‘Œ a function that is optimized
β€’ 𝑃(π‘‘π‘š
𝑦
|𝑓, π‘š, π‘Ž) the conditional probability of getting π‘‘π‘š
𝑦
by π‘š times running algorithm π‘Ž
on the function 𝑓
Theorem: For any pair of algorithms π‘Ž1 and π‘Ž2
σ𝑓 𝑃 π‘‘π‘š
𝑦
𝑓, π‘š, π‘Ž1 = σ𝑓 𝑃(π‘‘π‘š
𝑦
|𝑓, π‘š, π‘Ž2)
All algorithms are equal
β€œif an algorithm does
particularly well on average
for one class of problems then
it must do worse on average
over the remaining problems”
David H. Wolpert and William G. Macready: No Free Lunch Theorems for
Optimization, IEEE Transactions on Evolutionary Computation, 1(1):67-82
β†’There is no β€žone wayβ€œ to do data analysis
β€’ But there are some standard techniques that often
perform well
β€’ Many factors influence the suitable techniques
β€’ Data
β€’ Problem to be solved
β€’ Available resources
β€’ …
β†’ Broad portfolio of data analysis techniques required
Category Techniques Covered Problem to be solved
Association Rules Apriori Relationships between items
Clustering K-Means Clustering
DB Scan
Grouping of similar items
Identification of structures
Classification K-nearest Neighbor
Decision Trees
Random Forests
Logistic Regression
Naive Bayes
Support Vector Machines
Neural Networks
Assignment of labels to objects
Regression Linear Regression
Ridge
Lasso
Relationship between outcome and
inputs
Time Series Analysis ARMA Identification of temporal structures
Forecasting of temporal processes
Text Mining Bag-of-Words
Stemming/Lemmatization
TF-IDF
Analysis of textual data
β€’ Overview
β€’ Foundational Concepts
β€’ Summary
β€’ Definition after Tom M. Mitchell [2]:
β€’ A computer program is said to learn from experience 𝐸 with respect to some class of tasks 𝑇 and
performance measure 𝑃, if its performance at tasks in 𝑇, as measured by P, improves with
experience 𝐸.
β€’ Relation to the data analysis techniques
β€’ Experience 𝐸: our data
β€’ Task 𝑇: clustering/association mining/classification/…
β€’ Performance Measure 𝑃: depends on tasks
β€’ How would you describe this picture with general concepts?
Blue background
Has a fin
Oval body
Black top, white
bottom
β€’ 𝑂 is the object space
β€’ πœ™ is the feature map
β€’ β„± is the feature space
β€’ β„± = {πœ™ π‘œ , π‘œ ∈ 𝑂}
β€’ Example:
β€’ Five-dimensional space with dimensions as above
β€’ πœ™ "whalepicture" = (π‘‘π‘Ÿπ‘’π‘’, π‘œπ‘£π‘Žπ‘™, π‘π‘™π‘Žπ‘π‘˜, π‘€β„Žπ‘–π‘‘π‘’, 𝑏𝑙𝑒𝑒)
hasFin
shape
colorTop
colorBottom
background
Object Features
=
=
=
=
=
true
oval
black
white
blue
Feature map
πœ™: 𝑂 β†’ β„±
β€’ Stevensβ€˜ levels of measurement
Scale Property Allowed Operations Example
Nominal Classification or
membership
=, β‰  Color as β€žblackβ€œ, β€žwhiteβ€œ and β€žblueβ€œ
Ordinal Comparison or
levels
=, β‰ , >, < Size in β€žsmallβ€œ, β€žmediumβ€œ, and β€žlargeβ€œ
Interval Differences or
affinities
=, β‰ , >, <, +, βˆ’ Dates, temperatures,
discrete numeric values
Ratio Magnitudes or
amounts
=, β‰ , >, <, +, βˆ’,β‹… ,/ Size in cm, duration in seconds,
continuous numeric values
Categorical
β€’ Many algorithms can only work with numeric features
β€’ Encode categorical features as binary numeric features
β€’ Example: π‘₯ ∈ {small, medium, large }
β€’ Encode as three variables π‘₯π‘ π‘šπ‘Žπ‘™π‘™
, π‘₯π‘šπ‘’π‘‘π‘–π‘’π‘š
, π‘₯
π‘™π‘Žπ‘Ÿπ‘”π‘’
β€’ π‘₯π‘ π‘šπ‘Žπ‘™π‘™
= α‰Š
1 𝑖𝑓 π‘₯ = small
0 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
, …
β€’ Can also use one variable less, remaining case is encoded by all zeros
β€’ This is called One-Hot-Encoding
β€’ Instances of objects described by their features
β€’ Supervised learning if the value of interest is known
β†’ Classification, regression
β€’ Otherwise unsupervised learning
β†’ Clustering, Association Rule Mining
𝝓(𝒐)
hasFin shape colorTop colorBottom background value of interest
true oval black black blue whale
false rectangle brown brown green bear
… … … … … …
β€’ Data for the evaluation of analysis results
β€’ Same distribution as training data
β€’ Training data β‰  Test data
β€’ Evaluate generalization
β€’ Avoid overfitting
β€’ Analysis results only valid on training data
β€’ Different and not working on unseen data
β€’ Test data often difficult to obtain
And where do I
get the test data?
β€’ Data not used for training at all
β€’ Commonly used hold out data sizes
β€’ 50% of all data
β€’ 33% of all data
β€’ 25% of all data in case a validation set is used
β€’ Example:
β€’ Nine months of customer transactions available
β€’ First six months as training data
β€’ Last three months as test data
Depends a lot on available data!
β€’ Create π‘˜ partitions of available data
β€’ One partition for testing, all others for training
β€’ Estimate performance by averaging over the iterations
Partition 2 Partition 3 Partition 4 Partition 5
Partition 1 Partition 3 Partition 4 Partition 5
Partition 1 Partition 2 Partition 4 Partition 5
Partition 1 Partition 2 Partition 3 Partition 5
Partition 1 Partition 2 Partition 3 Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Available data
Test data Training data
Iteration 1
Iteration 2
Iteration 3
Iteration 3
Iteration 5
β€’ Overview
β€’ Foundational Concepts
β€’ Summary
β€’ No generic algorithm for all problems
β€’ Objects are described by features
β€’ Features are used for learning about objects
β€’ Data usually split into different sets for different purposes
A normal distribution shows the probability density for a population of continuous data (for example
height in cm for all NBA players).
In other words, it shows how likely is it that any player from the NBA is of a certain height.
Most players are around the mean/average height, fewer are much taller, or much shorter.
A normal distribution is symmetrical both sides of the mean.
You might also see this referred to as a Gaussian Distribution!
The spread of the values in our population is measured using a metric called standard deviation.
In addition, we also use Empirical Rule which tells us that...
β€’68% of the values will fall between 1 standard deviation above and below the mean
β€’95% of the values will fall between 2 standard deviations above and below the mean
β€’99.7% of the values will fall between 3 standard deviations above and below the mean
https://www.kaggle.com/code/ahmedmohamedmahrous/all-distributions-in-statistics/notebook
https://www.skillsyouneed.com/num/statistical-distributions.html
Just like a normal distribution, a t-distribution is symmetrical around the mean, and the breadth is
based around the deviation within the data.
While a normal distribution works with a population, a t-distribution is designed for situations
where sample size is small.
The shape of the t-distribution becomes broader as the sample size decreases, to take into
account the extra uncertainty we are faced with.
The shape of a t-distribution relates to the number of degrees of freedom which is calculated as
the sample size minus one.
As the sample size, and thus the degrees of freedom get larger, the t-distribution tends towards
a normal distribution -as with a larger sample we’re more certain around estimating the true
population statistics.
A Binomial Distribution can end up looking a lot like the shape of a normal
distribution.
The main difference is that instead of plotting continuous data, it instead plots
a distribution of two possible discrete outcomes
for example, the results from flipping a coin.
Imagine flipping a coin 10 times, and from those 10 flips, noting down how
many were "Heads".
It could be any number between 1 and 10.
Now imagine repeating that task 1,000 times...
If the coin we are using is indeed fair (not biased to heads or tails) then the
distribution of outcomes should start to look the plot above.
In the vast majority of cases, we get 4, 5, or 6 "heads" from each set of 10
flips, and the likelihood of getting more extreme results is much rarer!
The Bernoulli Distribution is a special case of the
Binomial Distribution.
It considers only two possible outcomes, success or
failure, true or false.
It’s a really simple distribution, but worth knowing!
In the example below we’re looking at the probability of
rolling a 6 with a standard die.
If we roll a die many, many times, we should end up with
a probability of rolling a 6, 1 out of every 6 times (or 16.7%)
and
thus a probability of not rolling a 6, in other words rolling a
1,2,3,4 or 5, 5 times out of 6 (or 83.3%) of the time!
A Uniform Distribution is a distribution in which all events
are equally likely to occur.
Below, we’re looking at the results from rolling a die many,
many times.
We’re looking at which number we got on each roll and
tallying these up.
If we roll the die enough times (and the die is fair) we
should end up with a completely uniform probability where
the chance of getting any outcome is exactly the same.
A Poisson Distribution is a discrete distribution similar to the
Binomial Distribution (in that we’re plotting the probability of
whole numbered outcomes).
Unlike the other distributions we have seen however, this one is
not symmetrical it is instead bounded between 0 and infinity.
The Poisson distribution describes the number of events or
outcomes that occur during some fixed interval.
Most commonly this is a time interval like in our example below
where we are plotting the distribution of sales per hour in a
shop.
In Statistics, a Hypothesis Test is used to assess & understand the plausibility, or likelihood
of some assumed viewpoint (a hypothesis) -based upon data.
In other words, we are using Statistics to test or investigate ideas.
Let's say we're the coach of an NBA Basketball team -we might find ourselves with the below dilemma.
The New Sensation >> Games Played: 2, Shooting Rate: 60%
The Current Star >>Games Played: 102, Shooting Rate: 49%
My new player seems to be a better shooter...
but she's only played 2 games.
I need more confidence before making a big change! Let's test it!
There are many, many different types of hypothesis test -each of which are appropriate for
different types of data, and/or dealing with different scenarios & comparisons.
Here we will cover several commonly used tests...
β€’ One Sample T-Test
β€’ Independent Samples T-Test
β€’ Paired Samples T-Test
β€’ Chi-Squared Test of Independence
A One Sample T-Test looks to assess differences between a sample, and the entire population
from which that sample resides.
If we were the coach of an NBA Basketball team, this might help us with the following question...
Is the average height (cm) of my team higher than that of the entire NBA?
An Independent Samples T-Test looks to assess differences between a sample, and another sample.
If we were the coach of an NBA Basketball team, this might help us with the following question...
Is the average height (cm) of my team higher than that of our rival team?
A Paired Samples T-Test looks to assess differences between a sample, and that same sample, at
another point in time.
If we were the coach of an NBA Basketball team, this might help us with the following question...
Has the average jumping height (cm) for my team increased after our 4-week fitness programme?
The Chi-Square Test for Independence is used to determine if there is a significant relationship
between two categorical variables.
It examines the actual and expected frequencies to determine if there is a dependence between the two
variables.
If we were the coach of an NBA Basketball team, this might help us with the following question...
Is the 3-point shooting % of my newly signed player, higher than that of my current star player?
The outcome of a hypothesis test (in other words whether or not we support our initial
assumption) is influenced by our Acceptance Criteria.
This Acceptance Criteriais often based upon a p-value.
So, let's talk about what a p-value is, and what a p-value is not!
A p-value essentially helps us assess whether the results of some finding or test we have
conducted are either, likely to be ordinary, or likely to be strange.
β€’In other words -these results we’ve got -do we think we’d get similar results if we ran many
more tests, or do we think our results might be something of a rarity? With an Acceptance
Criteria of 0.05, we are essentially saying that we want a likelihood of 5% (or lower) that our
result happened by chance.
β€’A p-value is not a probability of an event occurring, it is a probability or likelihood of seeing a
different result if we were to sample many times.
β€’A p-value does not tell us how different two samples are. Two samples with the
same difference in means, but larger/smaller samples sizes will get different p-values. A p-value
is instead telling us how likely it is that they are different (or in other words how confident we can
be that they are different)
It is important to lay down our hypotheses/assumptions and a required level of confidence before
running the test itself.
Generally, we specify 3 things; the Null Hypothesis, the Alternate Hypothesis, and our Acceptance
Criteria...
β€’Null Hypothesis(H0):The Null Hypothesis is where we state our initial viewpoint. In statistics, and
specifically hypothesis testing, our initial viewpoint is always that the result is purely by chance or that
there is no relationship or association between two outcomes or groups.
β€’Alternate Hypothesis(HA):The Alternate Hypothesis is essentially the opposite viewpoint to the Null
Hypothesis -that the result is not by chance, or that there is a relationship between two outcomes or
groups.
β€’Acceptance Criteria: The Acceptance Criteria is the threshold we set that will determine if
there is enough evidence to support the Null Hypothesis. This is often set to a p-value of 0.05 but it
does not have to be. If we want more certainty, we can set this to a lower value.
03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdf

More Related Content

Similar to 03-Data-Analysis-Final.pdf

Mixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsMixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsScott Fraundorf
Β 
ststs nw.pptx
ststs nw.pptxststs nw.pptx
ststs nw.pptxMrymNb
Β 
3. parametric assumptions
3. parametric assumptions3. parametric assumptions
3. parametric assumptionsSteve Saffhill
Β 
Unit 2 - Statistics
Unit 2 - StatisticsUnit 2 - Statistics
Unit 2 - StatisticsBruce Coulter
Β 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsJen Stirrup
Β 
Analysing The Data
Analysing The DataAnalysing The Data
Analysing The DataAngel Evans
Β 
Statistics And Correlation
Statistics And CorrelationStatistics And Correlation
Statistics And Correlationpankaj prabhakar
Β 
Presentation research- chapter 10-11 istiqlal
Presentation research- chapter 10-11 istiqlalPresentation research- chapter 10-11 istiqlal
Presentation research- chapter 10-11 istiqlalIstiqlalEid
Β 
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docxrhetttrevannion
Β 
Statistics for UX Professionals - Jessica Cameron
Statistics for UX Professionals - Jessica CameronStatistics for UX Professionals - Jessica Cameron
Statistics for UX Professionals - Jessica CameronUser Vision
Β 
MLSEV Virtual. Evaluations
MLSEV Virtual. EvaluationsMLSEV Virtual. Evaluations
MLSEV Virtual. EvaluationsBigML, Inc
Β 
Machine Learning Foundations
Machine Learning FoundationsMachine Learning Foundations
Machine Learning FoundationsAlbert Y. C. Chen
Β 
Statistics for UX Professionals
Statistics for UX ProfessionalsStatistics for UX Professionals
Statistics for UX ProfessionalsJessica Cameron
Β 
6.hypothesistesting.06
6.hypothesistesting.066.hypothesistesting.06
6.hypothesistesting.06MehwishFahad
Β 
Introduction to Statistics23122223.ppt
Introduction to Statistics23122223.pptIntroduction to Statistics23122223.ppt
Introduction to Statistics23122223.pptpathianithanaidu
Β 
Introduction to Statistics2312.ppt
Introduction to Statistics2312.pptIntroduction to Statistics2312.ppt
Introduction to Statistics2312.pptpathianithanaidu
Β 
11-Statistical-Tests.pptx
11-Statistical-Tests.pptx11-Statistical-Tests.pptx
11-Statistical-Tests.pptxShree Shree
Β 

Similar to 03-Data-Analysis-Final.pdf (20)

Mixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsMixed Effects Models - Random Intercepts
Mixed Effects Models - Random Intercepts
Β 
ststs nw.pptx
ststs nw.pptxststs nw.pptx
ststs nw.pptx
Β 
3. parametric assumptions
3. parametric assumptions3. parametric assumptions
3. parametric assumptions
Β 
Unit 2 - Statistics
Unit 2 - StatisticsUnit 2 - Statistics
Unit 2 - Statistics
Β 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
Β 
Analysing The Data
Analysing The DataAnalysing The Data
Analysing The Data
Β 
Statistics And Correlation
Statistics And CorrelationStatistics And Correlation
Statistics And Correlation
Β 
Presentation research- chapter 10-11 istiqlal
Presentation research- chapter 10-11 istiqlalPresentation research- chapter 10-11 istiqlal
Presentation research- chapter 10-11 istiqlal
Β 
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
35878 Topic Discussion5Number of Pages 1 (Double Spaced).docx
Β 
Statistics for UX Professionals - Jessica Cameron
Statistics for UX Professionals - Jessica CameronStatistics for UX Professionals - Jessica Cameron
Statistics for UX Professionals - Jessica Cameron
Β 
MLSEV Virtual. Evaluations
MLSEV Virtual. EvaluationsMLSEV Virtual. Evaluations
MLSEV Virtual. Evaluations
Β 
Statistics.pdf
Statistics.pdfStatistics.pdf
Statistics.pdf
Β 
Machine Learning Foundations
Machine Learning FoundationsMachine Learning Foundations
Machine Learning Foundations
Β 
Statistics for UX Professionals
Statistics for UX ProfessionalsStatistics for UX Professionals
Statistics for UX Professionals
Β 
6.hypothesistesting.06
6.hypothesistesting.066.hypothesistesting.06
6.hypothesistesting.06
Β 
Chapter3bps
Chapter3bpsChapter3bps
Chapter3bps
Β 
Chapter3bps
Chapter3bpsChapter3bps
Chapter3bps
Β 
Introduction to Statistics23122223.ppt
Introduction to Statistics23122223.pptIntroduction to Statistics23122223.ppt
Introduction to Statistics23122223.ppt
Β 
Introduction to Statistics2312.ppt
Introduction to Statistics2312.pptIntroduction to Statistics2312.ppt
Introduction to Statistics2312.ppt
Β 
11-Statistical-Tests.pptx
11-Statistical-Tests.pptx11-Statistical-Tests.pptx
11-Statistical-Tests.pptx
Β 

More from SugumarSarDurai

07 Data-Exploration.pdf
07 Data-Exploration.pdf07 Data-Exploration.pdf
07 Data-Exploration.pdfSugumarSarDurai
Β 
Unit 4 Time Study.pdf
Unit 4 Time Study.pdfUnit 4 Time Study.pdf
Unit 4 Time Study.pdfSugumarSarDurai
Β 
Unit 3 Micro and Memo motion study.pdf
Unit 3 Micro and Memo motion study.pdfUnit 3 Micro and Memo motion study.pdf
Unit 3 Micro and Memo motion study.pdfSugumarSarDurai
Β 
02 Work study -Part_1.pdf
02 Work study -Part_1.pdf02 Work study -Part_1.pdf
02 Work study -Part_1.pdfSugumarSarDurai
Β 
02 Method Study part_2.pdf
02 Method Study part_2.pdf02 Method Study part_2.pdf
02 Method Study part_2.pdfSugumarSarDurai
Β 
01 Production_part_2.pdf
01 Production_part_2.pdf01 Production_part_2.pdf
01 Production_part_2.pdfSugumarSarDurai
Β 
01 Production_part_1.pdf
01 Production_part_1.pdf01 Production_part_1.pdf
01 Production_part_1.pdfSugumarSarDurai
Β 
01 Industrial Management_Part_1a .pdf
01 Industrial Management_Part_1a .pdf01 Industrial Management_Part_1a .pdf
01 Industrial Management_Part_1a .pdfSugumarSarDurai
Β 
01 Industrial Management_Part_1 .pdf
01 Industrial Management_Part_1 .pdf01 Industrial Management_Part_1 .pdf
01 Industrial Management_Part_1 .pdfSugumarSarDurai
Β 

More from SugumarSarDurai (19)

Parking NYC.pdf
Parking NYC.pdfParking NYC.pdf
Parking NYC.pdf
Β 
Apache Spark
Apache SparkApache Spark
Apache Spark
Β 
Power BI.pdf
Power BI.pdfPower BI.pdf
Power BI.pdf
Β 
Unit 6.pdf
Unit 6.pdfUnit 6.pdf
Unit 6.pdf
Β 
Unit 5.pdf
Unit 5.pdfUnit 5.pdf
Unit 5.pdf
Β 
07 Data-Exploration.pdf
07 Data-Exploration.pdf07 Data-Exploration.pdf
07 Data-Exploration.pdf
Β 
06 Excel.pdf
06 Excel.pdf06 Excel.pdf
06 Excel.pdf
Β 
05 python.pdf
05 python.pdf05 python.pdf
05 python.pdf
Β 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
Β 
UNit4.pdf
UNit4.pdfUNit4.pdf
UNit4.pdf
Β 
UNit4d.pdf
UNit4d.pdfUNit4d.pdf
UNit4d.pdf
Β 
Unit 4 Time Study.pdf
Unit 4 Time Study.pdfUnit 4 Time Study.pdf
Unit 4 Time Study.pdf
Β 
Unit 3 Micro and Memo motion study.pdf
Unit 3 Micro and Memo motion study.pdfUnit 3 Micro and Memo motion study.pdf
Unit 3 Micro and Memo motion study.pdf
Β 
02 Work study -Part_1.pdf
02 Work study -Part_1.pdf02 Work study -Part_1.pdf
02 Work study -Part_1.pdf
Β 
02 Method Study part_2.pdf
02 Method Study part_2.pdf02 Method Study part_2.pdf
02 Method Study part_2.pdf
Β 
01 Production_part_2.pdf
01 Production_part_2.pdf01 Production_part_2.pdf
01 Production_part_2.pdf
Β 
01 Production_part_1.pdf
01 Production_part_1.pdf01 Production_part_1.pdf
01 Production_part_1.pdf
Β 
01 Industrial Management_Part_1a .pdf
01 Industrial Management_Part_1a .pdf01 Industrial Management_Part_1a .pdf
01 Industrial Management_Part_1a .pdf
Β 
01 Industrial Management_Part_1 .pdf
01 Industrial Management_Part_1 .pdf01 Industrial Management_Part_1 .pdf
01 Industrial Management_Part_1 .pdf
Β 

Recently uploaded

β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
Β 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
Β 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
Β 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
Β 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
Β 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
Β 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
Β 
18-04-UA_REPORT_MEDIALITERAΠ‘Y_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAΠ‘Y_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAΠ‘Y_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAΠ‘Y_INDEX-DM_23-1-final-eng.pdfssuser54595a
Β 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
Β 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
Β 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
Β 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
Β 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
Β 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)Dr. Mazin Mohamed alkathiri
Β 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
Β 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
Β 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
Β 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
Β 

Recently uploaded (20)

β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
Β 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
Β 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
Β 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
Β 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
Β 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
Β 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
Β 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
Β 
18-04-UA_REPORT_MEDIALITERAΠ‘Y_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAΠ‘Y_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAΠ‘Y_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAΠ‘Y_INDEX-DM_23-1-final-eng.pdf
Β 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
Β 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Β 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
Β 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
Β 
Model Call Girl in Bikash Puri Delhi reach out to us at πŸ”9953056974πŸ”
Model Call Girl in Bikash Puri  Delhi reach out to us at πŸ”9953056974πŸ”Model Call Girl in Bikash Puri  Delhi reach out to us at πŸ”9953056974πŸ”
Model Call Girl in Bikash Puri Delhi reach out to us at πŸ”9953056974πŸ”
Β 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
Β 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
Β 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
Β 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
Β 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
Β 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
Β 

03-Data-Analysis-Final.pdf

  • 1. Dr. D. Sugumar Associate Professor/ECE
  • 2. β€’ Overview β€’ Foundational Concepts β€’ Summary
  • 3. β€’ π‘‘π‘š 𝑦 ordered sets of size π‘š of cost values for 𝑦 ∈ π‘Œ β€’ 𝑓: 𝑋 β†’ π‘Œ a function that is optimized β€’ 𝑃(π‘‘π‘š 𝑦 |𝑓, π‘š, π‘Ž) the conditional probability of getting π‘‘π‘š 𝑦 by π‘š times running algorithm π‘Ž on the function 𝑓 Theorem: For any pair of algorithms π‘Ž1 and π‘Ž2 σ𝑓 𝑃 π‘‘π‘š 𝑦 𝑓, π‘š, π‘Ž1 = σ𝑓 𝑃(π‘‘π‘š 𝑦 |𝑓, π‘š, π‘Ž2) All algorithms are equal
  • 4. β€œif an algorithm does particularly well on average for one class of problems then it must do worse on average over the remaining problems” David H. Wolpert and William G. Macready: No Free Lunch Theorems for Optimization, IEEE Transactions on Evolutionary Computation, 1(1):67-82
  • 5. β†’There is no β€žone wayβ€œ to do data analysis β€’ But there are some standard techniques that often perform well β€’ Many factors influence the suitable techniques β€’ Data β€’ Problem to be solved β€’ Available resources β€’ … β†’ Broad portfolio of data analysis techniques required
  • 6. Category Techniques Covered Problem to be solved Association Rules Apriori Relationships between items Clustering K-Means Clustering DB Scan Grouping of similar items Identification of structures Classification K-nearest Neighbor Decision Trees Random Forests Logistic Regression Naive Bayes Support Vector Machines Neural Networks Assignment of labels to objects Regression Linear Regression Ridge Lasso Relationship between outcome and inputs Time Series Analysis ARMA Identification of temporal structures Forecasting of temporal processes Text Mining Bag-of-Words Stemming/Lemmatization TF-IDF Analysis of textual data
  • 7. β€’ Overview β€’ Foundational Concepts β€’ Summary
  • 8. β€’ Definition after Tom M. Mitchell [2]: β€’ A computer program is said to learn from experience 𝐸 with respect to some class of tasks 𝑇 and performance measure 𝑃, if its performance at tasks in 𝑇, as measured by P, improves with experience 𝐸. β€’ Relation to the data analysis techniques β€’ Experience 𝐸: our data β€’ Task 𝑇: clustering/association mining/classification/… β€’ Performance Measure 𝑃: depends on tasks
  • 9. β€’ How would you describe this picture with general concepts? Blue background Has a fin Oval body Black top, white bottom
  • 10. β€’ 𝑂 is the object space β€’ πœ™ is the feature map β€’ β„± is the feature space β€’ β„± = {πœ™ π‘œ , π‘œ ∈ 𝑂} β€’ Example: β€’ Five-dimensional space with dimensions as above β€’ πœ™ "whalepicture" = (π‘‘π‘Ÿπ‘’π‘’, π‘œπ‘£π‘Žπ‘™, π‘π‘™π‘Žπ‘π‘˜, π‘€β„Žπ‘–π‘‘π‘’, 𝑏𝑙𝑒𝑒) hasFin shape colorTop colorBottom background Object Features = = = = = true oval black white blue Feature map πœ™: 𝑂 β†’ β„±
  • 11. β€’ Stevensβ€˜ levels of measurement Scale Property Allowed Operations Example Nominal Classification or membership =, β‰  Color as β€žblackβ€œ, β€žwhiteβ€œ and β€žblueβ€œ Ordinal Comparison or levels =, β‰ , >, < Size in β€žsmallβ€œ, β€žmediumβ€œ, and β€žlargeβ€œ Interval Differences or affinities =, β‰ , >, <, +, βˆ’ Dates, temperatures, discrete numeric values Ratio Magnitudes or amounts =, β‰ , >, <, +, βˆ’,β‹… ,/ Size in cm, duration in seconds, continuous numeric values Categorical
  • 12. β€’ Many algorithms can only work with numeric features β€’ Encode categorical features as binary numeric features β€’ Example: π‘₯ ∈ {small, medium, large } β€’ Encode as three variables π‘₯π‘ π‘šπ‘Žπ‘™π‘™ , π‘₯π‘šπ‘’π‘‘π‘–π‘’π‘š , π‘₯ π‘™π‘Žπ‘Ÿπ‘”π‘’ β€’ π‘₯π‘ π‘šπ‘Žπ‘™π‘™ = α‰Š 1 𝑖𝑓 π‘₯ = small 0 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’ , … β€’ Can also use one variable less, remaining case is encoded by all zeros β€’ This is called One-Hot-Encoding
  • 13. β€’ Instances of objects described by their features β€’ Supervised learning if the value of interest is known β†’ Classification, regression β€’ Otherwise unsupervised learning β†’ Clustering, Association Rule Mining 𝝓(𝒐) hasFin shape colorTop colorBottom background value of interest true oval black black blue whale false rectangle brown brown green bear … … … … … …
  • 14. β€’ Data for the evaluation of analysis results β€’ Same distribution as training data β€’ Training data β‰  Test data β€’ Evaluate generalization β€’ Avoid overfitting β€’ Analysis results only valid on training data β€’ Different and not working on unseen data β€’ Test data often difficult to obtain And where do I get the test data?
  • 15. β€’ Data not used for training at all β€’ Commonly used hold out data sizes β€’ 50% of all data β€’ 33% of all data β€’ 25% of all data in case a validation set is used β€’ Example: β€’ Nine months of customer transactions available β€’ First six months as training data β€’ Last three months as test data Depends a lot on available data!
  • 16. β€’ Create π‘˜ partitions of available data β€’ One partition for testing, all others for training β€’ Estimate performance by averaging over the iterations Partition 2 Partition 3 Partition 4 Partition 5 Partition 1 Partition 3 Partition 4 Partition 5 Partition 1 Partition 2 Partition 4 Partition 5 Partition 1 Partition 2 Partition 3 Partition 5 Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Available data Test data Training data Iteration 1 Iteration 2 Iteration 3 Iteration 3 Iteration 5
  • 17. β€’ Overview β€’ Foundational Concepts β€’ Summary
  • 18. β€’ No generic algorithm for all problems β€’ Objects are described by features β€’ Features are used for learning about objects β€’ Data usually split into different sets for different purposes
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52. A normal distribution shows the probability density for a population of continuous data (for example height in cm for all NBA players). In other words, it shows how likely is it that any player from the NBA is of a certain height. Most players are around the mean/average height, fewer are much taller, or much shorter. A normal distribution is symmetrical both sides of the mean. You might also see this referred to as a Gaussian Distribution!
  • 53. The spread of the values in our population is measured using a metric called standard deviation. In addition, we also use Empirical Rule which tells us that... β€’68% of the values will fall between 1 standard deviation above and below the mean β€’95% of the values will fall between 2 standard deviations above and below the mean β€’99.7% of the values will fall between 3 standard deviations above and below the mean
  • 55. Just like a normal distribution, a t-distribution is symmetrical around the mean, and the breadth is based around the deviation within the data. While a normal distribution works with a population, a t-distribution is designed for situations where sample size is small. The shape of the t-distribution becomes broader as the sample size decreases, to take into account the extra uncertainty we are faced with. The shape of a t-distribution relates to the number of degrees of freedom which is calculated as the sample size minus one. As the sample size, and thus the degrees of freedom get larger, the t-distribution tends towards a normal distribution -as with a larger sample we’re more certain around estimating the true population statistics.
  • 56.
  • 57. A Binomial Distribution can end up looking a lot like the shape of a normal distribution. The main difference is that instead of plotting continuous data, it instead plots a distribution of two possible discrete outcomes for example, the results from flipping a coin. Imagine flipping a coin 10 times, and from those 10 flips, noting down how many were "Heads". It could be any number between 1 and 10. Now imagine repeating that task 1,000 times... If the coin we are using is indeed fair (not biased to heads or tails) then the distribution of outcomes should start to look the plot above. In the vast majority of cases, we get 4, 5, or 6 "heads" from each set of 10 flips, and the likelihood of getting more extreme results is much rarer!
  • 58. The Bernoulli Distribution is a special case of the Binomial Distribution. It considers only two possible outcomes, success or failure, true or false. It’s a really simple distribution, but worth knowing! In the example below we’re looking at the probability of rolling a 6 with a standard die. If we roll a die many, many times, we should end up with a probability of rolling a 6, 1 out of every 6 times (or 16.7%) and thus a probability of not rolling a 6, in other words rolling a 1,2,3,4 or 5, 5 times out of 6 (or 83.3%) of the time!
  • 59. A Uniform Distribution is a distribution in which all events are equally likely to occur. Below, we’re looking at the results from rolling a die many, many times. We’re looking at which number we got on each roll and tallying these up. If we roll the die enough times (and the die is fair) we should end up with a completely uniform probability where the chance of getting any outcome is exactly the same.
  • 60. A Poisson Distribution is a discrete distribution similar to the Binomial Distribution (in that we’re plotting the probability of whole numbered outcomes). Unlike the other distributions we have seen however, this one is not symmetrical it is instead bounded between 0 and infinity. The Poisson distribution describes the number of events or outcomes that occur during some fixed interval. Most commonly this is a time interval like in our example below where we are plotting the distribution of sales per hour in a shop.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65. In Statistics, a Hypothesis Test is used to assess & understand the plausibility, or likelihood of some assumed viewpoint (a hypothesis) -based upon data. In other words, we are using Statistics to test or investigate ideas. Let's say we're the coach of an NBA Basketball team -we might find ourselves with the below dilemma. The New Sensation >> Games Played: 2, Shooting Rate: 60% The Current Star >>Games Played: 102, Shooting Rate: 49% My new player seems to be a better shooter... but she's only played 2 games. I need more confidence before making a big change! Let's test it!
  • 66. There are many, many different types of hypothesis test -each of which are appropriate for different types of data, and/or dealing with different scenarios & comparisons. Here we will cover several commonly used tests... β€’ One Sample T-Test β€’ Independent Samples T-Test β€’ Paired Samples T-Test β€’ Chi-Squared Test of Independence
  • 67. A One Sample T-Test looks to assess differences between a sample, and the entire population from which that sample resides. If we were the coach of an NBA Basketball team, this might help us with the following question... Is the average height (cm) of my team higher than that of the entire NBA?
  • 68. An Independent Samples T-Test looks to assess differences between a sample, and another sample. If we were the coach of an NBA Basketball team, this might help us with the following question... Is the average height (cm) of my team higher than that of our rival team?
  • 69. A Paired Samples T-Test looks to assess differences between a sample, and that same sample, at another point in time. If we were the coach of an NBA Basketball team, this might help us with the following question... Has the average jumping height (cm) for my team increased after our 4-week fitness programme?
  • 70. The Chi-Square Test for Independence is used to determine if there is a significant relationship between two categorical variables. It examines the actual and expected frequencies to determine if there is a dependence between the two variables. If we were the coach of an NBA Basketball team, this might help us with the following question... Is the 3-point shooting % of my newly signed player, higher than that of my current star player?
  • 71. The outcome of a hypothesis test (in other words whether or not we support our initial assumption) is influenced by our Acceptance Criteria. This Acceptance Criteriais often based upon a p-value. So, let's talk about what a p-value is, and what a p-value is not! A p-value essentially helps us assess whether the results of some finding or test we have conducted are either, likely to be ordinary, or likely to be strange.
  • 72. β€’In other words -these results we’ve got -do we think we’d get similar results if we ran many more tests, or do we think our results might be something of a rarity? With an Acceptance Criteria of 0.05, we are essentially saying that we want a likelihood of 5% (or lower) that our result happened by chance. β€’A p-value is not a probability of an event occurring, it is a probability or likelihood of seeing a different result if we were to sample many times. β€’A p-value does not tell us how different two samples are. Two samples with the same difference in means, but larger/smaller samples sizes will get different p-values. A p-value is instead telling us how likely it is that they are different (or in other words how confident we can be that they are different)
  • 73. It is important to lay down our hypotheses/assumptions and a required level of confidence before running the test itself. Generally, we specify 3 things; the Null Hypothesis, the Alternate Hypothesis, and our Acceptance Criteria... β€’Null Hypothesis(H0):The Null Hypothesis is where we state our initial viewpoint. In statistics, and specifically hypothesis testing, our initial viewpoint is always that the result is purely by chance or that there is no relationship or association between two outcomes or groups. β€’Alternate Hypothesis(HA):The Alternate Hypothesis is essentially the opposite viewpoint to the Null Hypothesis -that the result is not by chance, or that there is a relationship between two outcomes or groups. β€’Acceptance Criteria: The Acceptance Criteria is the threshold we set that will determine if there is enough evidence to support the Null Hypothesis. This is often set to a p-value of 0.05 but it does not have to be. If we want more certainty, we can set this to a lower value.