SlideShare a Scribd company logo
1 of 15
10 Must-Know Statistical Concepts for
Data Scientists
https://oke.io/vzCQ
Statistics is a building block of data science
Data science is an interdisciplinary field. One of the building blocks of
data science is statistics.Without a decent level of statistics knowledge,
it would be highly difficult to understand or interpret the data.
Statistics helps us explain the data. We use statistics to inferresults
about a population based on a sample drawn from that population.
Furthermore,machine learning and statistics have plenty of overlaps.
Long story short, one needs to study and learn statistics and its
concepts to become a data scientist.In this article, I will try to explain
10 fundamental statistical concepts.
https://oke.io/vzCQ
1. Population and sample
Population is all elements in a group. For example, college students in
US is a population that includes all of the college students in US. 25-
year-old people in Europe is a population that includes all of the people
that fits the description.
It is not always feasible or possible to do analysis on population
because we cannot collect all the data of a population. Therefore,we
use samples.
Sample is a subset of a population. For example, 1000 college students
in US is a subset of “college students in US” population.
https://oke.io/vzCQ
2. Normal distribution
Probability distribution is a function that shows the probabilities of the
outcomes of an event or experiment.Consider a feature (i.e. column) in
a dataframe. This feature is a variable and its probability distribution
function shows the likelihoodof the values it can take.
Probability distribution functions are quite useful in predictive
analytics or machine learning. We can make predictions about a
population based on the probability distribution function of a sample
from that population.
Normal (Gaussian) distribution is a probability distribution function
that looks like a bell.
The followingfigure illustrates the shape of a typical normal
distribution curve which was created it by usinga random sample
returned by numpy.random.randn function of NumPy.
A typical normal distribution curve
The peak of the curve indicates the most likely value the variable can
take. As we move away from the peak the probability of the values
decrease.
The followingis a more formal representation of normal distribution.
The percentages indicate the percentage of data that falls in that
region. As we move away from the mean, we start to see more extreme
values with less probability to be observed.
M.W. Toews via Wikipedia
https://oke.io/vzCQ
3. Measures of central tendency
Central tendency is the central (or typical) value of a probability
distribution.The most common measures of central tendency are
mean, median, and mode.
 Mean is the average of the values in series.
 Median is the value in the middle when values are sorted in
ascending or descending order.
 Mode is the value that appears most often.
https://oke.io/vzCQ
4. Variance and standard deviation
Variance is a measure of the variation among values. It is calculated by
adding up squared differences of each value and the mean and then
dividing the sum by the number of samples.
Standard deviation is a measure of how spread out the values are. To
be more specific, it is the square root of variance.
Note: Mean, median, mode, variance, and standard deviation are basic
descriptive statistics that help to explain a variable.
https://oke.io/vzCQ
5. Covariance and correlation
Covariance is a quantitative measure that represents how much the
variations of two variables match each other.To be more specific,
covariance compares two variables in terms of the deviations from
their mean (or expected) value.
The figure below shows some values of the random variables X and Y.
The orange dot represents the mean of these variables. The values
change similarly with respect to the mean value of the variables. Thus,
there is positive covariance between X and Y.
The formulafor covariance of two random variables:
where E is the expected value and µ is the mean.
Note: The covariance of a variable with itself is the variance of that
variable.
Correlation is a normalization of covariance by the standard deviation
of each variable.
where σ is the standard deviation.
This normalization cancels out the units and the correlation value is
always between 0 and 1. Please note that this is the absolute value.In
case of a negative correlation between two variables, the correlation is
between 0 and -1. If we are comparing the relationshipamong three or
more variables, it is better to use correlation because the value ranges
or unit may cause false assumptions.
https://oke.io/vzCQ
6. Central limit theorem
In many fields including natural and social sciences,when the
distribution of a random variable is unknown,normal distribution is
used.
Central limit theorem (CLT) justifies why normal distribution can be
used in such cases.According to the CLT, as we take more samples
from a distribution, the sample averages will tend towards a normal
distribution regardless of the population distribution.
Consider a case that we need to learn the distribution of the heights of
all 20-year-old people in a country.It is almost impossible and, of
course not practical, to collect this data. So, we take samples of 20-
year-old people across the country and calculate the average height of
the people in samples. CLT states that as we take more samples from
the population, samplingdistribution will get close to a normal
distribution.
Why is it so important to have a normal distribution? Normal
distribution is described in terms of mean and standard deviation
which can easily be calculated.And, if we know the mean and standard
deviation of a normal distribution,we can compute pretty much
everythingabout it.
7. P-value
https://oke.io/vzCQ
P-value is a measure of the likelihoodof a value that a random variable
takes. Considerwe have a random variable A and the value x. The p-
value of x is the probability that A takes the value x or any value that
has the same or less chance to be observed. The figure below shows the
probability distribution of A. It is highly likely to observe a value
around 10. As the values get higheror lower, the probabilities decrease.
Probability distribution of A
We have anotherrandom variable B and want to see if B is greater than
A. The average sample means obtained from B is 12.5 . The p value for
12.5 is the green area in the graph below. The green area indicates the
probability of getting 12.5 or a more extreme value (higherthan 12.5 in
our case).
Let’s say the p value is 0.11 but how do we interpret it? A p value of 0.11
means that we are 89% sure of the results.In other words, there is 11%
chance that the results are due to random chance. Similarly,a p value
of 0.05 means that there is 5% chance that the results are due to
random chance.
Note: Lower p values show more certainty in the result.
If the average of sample means from the random variable B turns out
to be 15 which is a more extreme value, the p value will be lowerthan
0.11.
https://oke.io/vzCQ
8. Expected value of random variables
The expected value of a random variable is the weightedaverage of all
possible values of the variable. The weight here means the probability
of the random variable taking a specificvalue.
The expected value is calculateddifferently for discrete and continuous
random variables.
 Discrete random variables take finitely many or countably
infinitely many values.The number of rainy days in a year is a
discrete random variable.
 Continuous random variables take uncountably infinitely
many values.For instance,the time it takes from your home
to the office is a continuous random variable. Depending on
how you measure it (minutes, seconds,nanoseconds,and so
on), it takes uncountably infinitelymany values.
The formulafor the expected value of a discrete random variable is:
The expected value of a continuous random variable is calculatedwith
the same logic but usingdifferent methods.Since continuous random
variables can take uncountably infinitely many values,we cannot talk
about a variable taking a specific value.We rather focus on value
ranges.
In order to calculate the probability of value ranges, probability density
functions (PDF) are used. PDF is a function that specifies the
probability of a random variable taking value within a particular range.
https://oke.io/vzCQ
9. Conditional probability
Probability simply means the likelihood of an event to occur and
always takes a value between 0 and 1 (0 and 1 inclusive).The
probability of event A is denoted as p(A) and calculatedas the number
of the desired outcome divided by the number of all outcomes.For
example, when you roll a die, the probability of getting a number less
than three is 2 / 6. The number of desired outcomes is 2 (1 and 2); the
numberof total outcomes is 6.
Conditional probability is the likelihoodof an event A to occur given
that anotherevent that has a relation with event A has already
occurred.
Suppose that we have 6 blue balls and 4 yellows placed in two boxes as
seen below. I ask you to randomly pick a ball. The probability of getting
a blue ball is 6 / 10 = 0,6. What if I ask you to pick a ball from box A?
The probability of picking a blue ball clearly decreases.The condition
here is to pick from box A which clearly changes the probability of the
event (picking a blue ball). The probability of event A given that event
B has occurred is denoted as p(A|B).
https://oke.io/vzCQ
10. Bayes’ theorem
According to Bayes’ theorem,probability of event A given that event B
has already occurred can be calculatedusing the probabilities of event
A and event B and probability of event B given that A has already
occurred.
Bayes’ theorem is so fundamental and ubiquitous that a field called
“bayesian statistics”exists.In bayesian statistics,the probability of an
event or hypothesis as evidence comes into play. Therefore,prior
probabilities and posterior probabilities differ depending on the
evidence.
Naive bayes algorithm is structuredby combining bayes’ theorem and
some naive assumptions.Naive bayes algorithm assumes that features
are independent of each other and there is no correlation between
features.
Conclusion
We have covered some basic yet fundamental statistical concepts.If
you are working or plan to work in the field of data science,you are
likely to encounterthese concepts.
There is, of course, much more to learn about statistics.Once you
understand the basics, you can steadily build your way up to advanced
topics.
Thank you for reading. Please let me know if you have any feedback.
https://oke.io/vzCQ

More Related Content

Similar to 10 Must-Know Statistical Concepts for Data Scientists.docx

Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data sciencepujashri1975
 
F ProjHOSPITAL INPATIENT P & L20162017Variance Variance Per DC 20.docx
F ProjHOSPITAL INPATIENT P & L20162017Variance Variance Per DC 20.docxF ProjHOSPITAL INPATIENT P & L20162017Variance Variance Per DC 20.docx
F ProjHOSPITAL INPATIENT P & L20162017Variance Variance Per DC 20.docxmecklenburgstrelitzh
 
Normal and standard normal distribution
Normal and standard normal distributionNormal and standard normal distribution
Normal and standard normal distributionAvjinder (Avi) Kaler
 
Topic Learning TeamNumber of Pages 2 (Double Spaced)Num.docx
Topic Learning TeamNumber of Pages 2 (Double Spaced)Num.docxTopic Learning TeamNumber of Pages 2 (Double Spaced)Num.docx
Topic Learning TeamNumber of Pages 2 (Double Spaced)Num.docxAASTHA76
 
Graphical presentation of data
Graphical presentation of dataGraphical presentation of data
Graphical presentation of datadrasifk
 
Bio-statistics definitions and misconceptions
Bio-statistics definitions and misconceptionsBio-statistics definitions and misconceptions
Bio-statistics definitions and misconceptionsQussai Abbas
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionAntony Raj
 
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docxPage 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docxkarlhennesey
 
Statistical Significance Tests.pptx
Statistical Significance Tests.pptxStatistical Significance Tests.pptx
Statistical Significance Tests.pptxAldofChrist
 
Types of Statistics
Types of Statistics Types of Statistics
Types of Statistics Rupak Roy
 
Lecture 5 Sampling distribution of sample mean.pptx
Lecture 5 Sampling distribution of sample mean.pptxLecture 5 Sampling distribution of sample mean.pptx
Lecture 5 Sampling distribution of sample mean.pptxshakirRahman10
 
5_lectureslides.pptx
5_lectureslides.pptx5_lectureslides.pptx
5_lectureslides.pptxsuchita74
 
In the last column we discussed the use of pooling to get a be
In the last column we discussed the use of pooling to get a beIn the last column we discussed the use of pooling to get a be
In the last column we discussed the use of pooling to get a beMalikPinckney86
 
Math class 8 data handling
Math class 8 data handling Math class 8 data handling
Math class 8 data handling HimakshiKava
 

Similar to 10 Must-Know Statistical Concepts for Data Scientists.docx (20)

Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data science
 
F ProjHOSPITAL INPATIENT P & L20162017Variance Variance Per DC 20.docx
F ProjHOSPITAL INPATIENT P & L20162017Variance Variance Per DC 20.docxF ProjHOSPITAL INPATIENT P & L20162017Variance Variance Per DC 20.docx
F ProjHOSPITAL INPATIENT P & L20162017Variance Variance Per DC 20.docx
 
2-20-04.ppt
2-20-04.ppt2-20-04.ppt
2-20-04.ppt
 
Normal and standard normal distribution
Normal and standard normal distributionNormal and standard normal distribution
Normal and standard normal distribution
 
Topic Learning TeamNumber of Pages 2 (Double Spaced)Num.docx
Topic Learning TeamNumber of Pages 2 (Double Spaced)Num.docxTopic Learning TeamNumber of Pages 2 (Double Spaced)Num.docx
Topic Learning TeamNumber of Pages 2 (Double Spaced)Num.docx
 
Graphical presentation of data
Graphical presentation of dataGraphical presentation of data
Graphical presentation of data
 
Machine learning session1
Machine learning   session1Machine learning   session1
Machine learning session1
 
Data science
Data scienceData science
Data science
 
Bio-statistics definitions and misconceptions
Bio-statistics definitions and misconceptionsBio-statistics definitions and misconceptions
Bio-statistics definitions and misconceptions
 
Fonaments d estadistica
Fonaments d estadisticaFonaments d estadistica
Fonaments d estadistica
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docxPage 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
 
Statistical Significance Tests.pptx
Statistical Significance Tests.pptxStatistical Significance Tests.pptx
Statistical Significance Tests.pptx
 
Types of Statistics
Types of Statistics Types of Statistics
Types of Statistics
 
Inorganic CHEMISTRY
Inorganic CHEMISTRYInorganic CHEMISTRY
Inorganic CHEMISTRY
 
Statistics excellent
Statistics excellentStatistics excellent
Statistics excellent
 
Lecture 5 Sampling distribution of sample mean.pptx
Lecture 5 Sampling distribution of sample mean.pptxLecture 5 Sampling distribution of sample mean.pptx
Lecture 5 Sampling distribution of sample mean.pptx
 
5_lectureslides.pptx
5_lectureslides.pptx5_lectureslides.pptx
5_lectureslides.pptx
 
In the last column we discussed the use of pooling to get a be
In the last column we discussed the use of pooling to get a beIn the last column we discussed the use of pooling to get a be
In the last column we discussed the use of pooling to get a be
 
Math class 8 data handling
Math class 8 data handling Math class 8 data handling
Math class 8 data handling
 

Recently uploaded

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 

Recently uploaded (20)

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 

10 Must-Know Statistical Concepts for Data Scientists.docx

  • 1. 10 Must-Know Statistical Concepts for Data Scientists https://oke.io/vzCQ Statistics is a building block of data science Data science is an interdisciplinary field. One of the building blocks of data science is statistics.Without a decent level of statistics knowledge, it would be highly difficult to understand or interpret the data. Statistics helps us explain the data. We use statistics to inferresults about a population based on a sample drawn from that population. Furthermore,machine learning and statistics have plenty of overlaps.
  • 2. Long story short, one needs to study and learn statistics and its concepts to become a data scientist.In this article, I will try to explain 10 fundamental statistical concepts. https://oke.io/vzCQ 1. Population and sample Population is all elements in a group. For example, college students in US is a population that includes all of the college students in US. 25- year-old people in Europe is a population that includes all of the people that fits the description. It is not always feasible or possible to do analysis on population because we cannot collect all the data of a population. Therefore,we use samples. Sample is a subset of a population. For example, 1000 college students in US is a subset of “college students in US” population. https://oke.io/vzCQ 2. Normal distribution Probability distribution is a function that shows the probabilities of the outcomes of an event or experiment.Consider a feature (i.e. column) in a dataframe. This feature is a variable and its probability distribution function shows the likelihoodof the values it can take.
  • 3. Probability distribution functions are quite useful in predictive analytics or machine learning. We can make predictions about a population based on the probability distribution function of a sample from that population. Normal (Gaussian) distribution is a probability distribution function that looks like a bell. The followingfigure illustrates the shape of a typical normal distribution curve which was created it by usinga random sample returned by numpy.random.randn function of NumPy. A typical normal distribution curve The peak of the curve indicates the most likely value the variable can take. As we move away from the peak the probability of the values decrease.
  • 4. The followingis a more formal representation of normal distribution. The percentages indicate the percentage of data that falls in that region. As we move away from the mean, we start to see more extreme values with less probability to be observed. M.W. Toews via Wikipedia https://oke.io/vzCQ 3. Measures of central tendency Central tendency is the central (or typical) value of a probability distribution.The most common measures of central tendency are mean, median, and mode.  Mean is the average of the values in series.  Median is the value in the middle when values are sorted in ascending or descending order.  Mode is the value that appears most often.
  • 5. https://oke.io/vzCQ 4. Variance and standard deviation Variance is a measure of the variation among values. It is calculated by adding up squared differences of each value and the mean and then dividing the sum by the number of samples. Standard deviation is a measure of how spread out the values are. To be more specific, it is the square root of variance. Note: Mean, median, mode, variance, and standard deviation are basic descriptive statistics that help to explain a variable. https://oke.io/vzCQ 5. Covariance and correlation
  • 6. Covariance is a quantitative measure that represents how much the variations of two variables match each other.To be more specific, covariance compares two variables in terms of the deviations from their mean (or expected) value. The figure below shows some values of the random variables X and Y. The orange dot represents the mean of these variables. The values change similarly with respect to the mean value of the variables. Thus, there is positive covariance between X and Y. The formulafor covariance of two random variables:
  • 7. where E is the expected value and µ is the mean. Note: The covariance of a variable with itself is the variance of that variable. Correlation is a normalization of covariance by the standard deviation of each variable. where σ is the standard deviation. This normalization cancels out the units and the correlation value is always between 0 and 1. Please note that this is the absolute value.In case of a negative correlation between two variables, the correlation is between 0 and -1. If we are comparing the relationshipamong three or more variables, it is better to use correlation because the value ranges or unit may cause false assumptions. https://oke.io/vzCQ 6. Central limit theorem In many fields including natural and social sciences,when the distribution of a random variable is unknown,normal distribution is used.
  • 8. Central limit theorem (CLT) justifies why normal distribution can be used in such cases.According to the CLT, as we take more samples from a distribution, the sample averages will tend towards a normal distribution regardless of the population distribution. Consider a case that we need to learn the distribution of the heights of all 20-year-old people in a country.It is almost impossible and, of course not practical, to collect this data. So, we take samples of 20- year-old people across the country and calculate the average height of the people in samples. CLT states that as we take more samples from the population, samplingdistribution will get close to a normal distribution. Why is it so important to have a normal distribution? Normal distribution is described in terms of mean and standard deviation which can easily be calculated.And, if we know the mean and standard deviation of a normal distribution,we can compute pretty much everythingabout it. 7. P-value https://oke.io/vzCQ P-value is a measure of the likelihoodof a value that a random variable takes. Considerwe have a random variable A and the value x. The p- value of x is the probability that A takes the value x or any value that has the same or less chance to be observed. The figure below shows the
  • 9. probability distribution of A. It is highly likely to observe a value around 10. As the values get higheror lower, the probabilities decrease. Probability distribution of A We have anotherrandom variable B and want to see if B is greater than A. The average sample means obtained from B is 12.5 . The p value for 12.5 is the green area in the graph below. The green area indicates the probability of getting 12.5 or a more extreme value (higherthan 12.5 in our case).
  • 10. Let’s say the p value is 0.11 but how do we interpret it? A p value of 0.11 means that we are 89% sure of the results.In other words, there is 11% chance that the results are due to random chance. Similarly,a p value of 0.05 means that there is 5% chance that the results are due to random chance. Note: Lower p values show more certainty in the result. If the average of sample means from the random variable B turns out to be 15 which is a more extreme value, the p value will be lowerthan 0.11.
  • 11. https://oke.io/vzCQ 8. Expected value of random variables The expected value of a random variable is the weightedaverage of all possible values of the variable. The weight here means the probability of the random variable taking a specificvalue. The expected value is calculateddifferently for discrete and continuous random variables.  Discrete random variables take finitely many or countably infinitely many values.The number of rainy days in a year is a discrete random variable.  Continuous random variables take uncountably infinitely many values.For instance,the time it takes from your home to the office is a continuous random variable. Depending on
  • 12. how you measure it (minutes, seconds,nanoseconds,and so on), it takes uncountably infinitelymany values. The formulafor the expected value of a discrete random variable is: The expected value of a continuous random variable is calculatedwith the same logic but usingdifferent methods.Since continuous random variables can take uncountably infinitely many values,we cannot talk about a variable taking a specific value.We rather focus on value ranges. In order to calculate the probability of value ranges, probability density functions (PDF) are used. PDF is a function that specifies the probability of a random variable taking value within a particular range.
  • 13. https://oke.io/vzCQ 9. Conditional probability Probability simply means the likelihood of an event to occur and always takes a value between 0 and 1 (0 and 1 inclusive).The probability of event A is denoted as p(A) and calculatedas the number of the desired outcome divided by the number of all outcomes.For example, when you roll a die, the probability of getting a number less than three is 2 / 6. The number of desired outcomes is 2 (1 and 2); the numberof total outcomes is 6. Conditional probability is the likelihoodof an event A to occur given that anotherevent that has a relation with event A has already occurred. Suppose that we have 6 blue balls and 4 yellows placed in two boxes as seen below. I ask you to randomly pick a ball. The probability of getting a blue ball is 6 / 10 = 0,6. What if I ask you to pick a ball from box A?
  • 14. The probability of picking a blue ball clearly decreases.The condition here is to pick from box A which clearly changes the probability of the event (picking a blue ball). The probability of event A given that event B has occurred is denoted as p(A|B). https://oke.io/vzCQ 10. Bayes’ theorem According to Bayes’ theorem,probability of event A given that event B has already occurred can be calculatedusing the probabilities of event A and event B and probability of event B given that A has already occurred. Bayes’ theorem is so fundamental and ubiquitous that a field called “bayesian statistics”exists.In bayesian statistics,the probability of an event or hypothesis as evidence comes into play. Therefore,prior
  • 15. probabilities and posterior probabilities differ depending on the evidence. Naive bayes algorithm is structuredby combining bayes’ theorem and some naive assumptions.Naive bayes algorithm assumes that features are independent of each other and there is no correlation between features. Conclusion We have covered some basic yet fundamental statistical concepts.If you are working or plan to work in the field of data science,you are likely to encounterthese concepts. There is, of course, much more to learn about statistics.Once you understand the basics, you can steadily build your way up to advanced topics. Thank you for reading. Please let me know if you have any feedback. https://oke.io/vzCQ