SlideShare a Scribd company logo
1 of 34
Download to read offline
1.0 Descriptive Statistics
Overview and Descriptive Statistics introduction
• Statistical concepts and methods are critical to
understanding the world around us.
• The science of collecting, organizing, presenting,
analyzing, and interpreting data to assist in making more
effective decisions
– Allow us to make informed decisions based upon data in the
presence of uncertainty and variation
• Statistical methods have broad applications
– Evaluating research
– Making predictions
– Understanding variability among components
manufactured by a certain process
Numerical Summaries of Data
• Data summaries and displays are essential to good statistical thinking
• It is useful to describe data features numerically
• Characterizing the location or central tendency in the data is an example of a
numerical summary
• Data are often a sample of observations that have been selected from some
larger population of observations
• The population is the collection of all individuals or objects under
consideration in a statistical study
• Obtaining information on entire population (census) often
impractical
• Instead use a sample or subset (part) of the population to obtain
information
• parameter is a descriptive measure for a population
• e.g. population mean, µ, population
standard deviation, σ
A statistic is a descriptive measure for a sample
• e.g. About 33% of people in our sample were between 30 and 40 years old.
• Used to estimate population parameters
– e.g. sample mean, x̄,
sample standard deviation, s
Practice:
1.Determine whether each group is best described as a sample or a population:
a)The participants in a study of a new cholesterol drug.
b)All the people living in Doha.
2.Determine whether each situation is a sample or a census:
a)The government of Qatar collects information from all residents about
their income.
b)The government of Qatar asks 1000 residents of Doha where they like to
shop.
c)Data about the effectiveness of a new vaccine that has been given to
volunteers.
d)The types of cars driven by all residents of Doha.
Measures of Center ( Mean , Median, Mode)
- Sample mean
The location or central tendency in the data can be characterized by the
arithmetic average or the sample mean. Calculated by adding up all values in the
data set, then dividing it by the number of values that are in the data set.
Mathematically, the formula is written as
=
 X
X
n
X This is read as X bar. Sample mean.
n Sample size.
X Random variable to denote the items in a sample.
- Population Mean:
X
N
 = 
 Greek letter mu, always used to denote population mean.
N Population size. Capital N is used for population size.
X Random variable to denote the items in a population.
➢A different symbol is used for measure if the data is from a sample.
Example 1:
- The Median
➢It is the middle value once the data have been sorted into ascending or descending
order.
➢If we have an odd number of data points, it is the middle value.
3, 5, 8, 12, 17, 18, 19
➢If we have an even number of data points the median is the mean of the middle
two points. Therefore, the median does not have to be one of the data points.
12, 8, 5, 3, 21, 18, 19, 17
i. We first have to order the numbers
ii. 3, 5, 8, 12, 17, 18, 19, 21
iii. The median is
12 17
14.5
2
+
=
The Mode:
➢This is the value in the data set with the greatest frequency. It is possible to vae
mopre than one mode in a data set.
Example: Consider the following data sets. What is the mode?
12, 13, 14, 15, 12, 4, 2, 6, 14, 13, 15, 12, 4, 12
It is helpful to sort the list first.:2 4 4 6 12 12 12 12 13 13 14 14 15 15
Mode is 12
Example: Consider the following data sets. What is the mode?
12, 13, 14, 15, 12, 4, 2, 15, 6, 14, 13, 15, 12, 4, 12, 15
It is helpful to sort the list first.:2 4 4 6 12 12 12 12 13 13 14 14 15 15 15 15
Mode is 12 15
and
Roles of the mean, median and mode
➢The mean, median and mode are useful measures depending on what it is being
used for. Choosing the best measure of central tendency depends on the type of data
you have.
Examples:
1)Suppose the annual salaries of a sample of Accountants $62,900, $61,600, $62,500,
$60,800, and $1,200,000.
• The mean salary is $289,560. All accountants have an income between
$60,000 to $63,000 except for the last with $1.2 million salary. This salary is
affecting the mean calculation. Obviously, it is not a representative average of
this group of workers.
• For data containing one or two very large or very small values, the mean may
not be representative. The center for such data can be better represented by the
median.
$60,800, $61,600, $62,500, $62,900, $1,200,000
$62,500 would be a more representative value for the average salary.
Measures of Variability
• Measures of center only give partial information abouta
distribution
• Consider the following three samples:
A 8 9 10 11 12
B 4 8 10 12 16
C 4 7 10 13 16
Standard Deviation and Variance
➢ The standard deviation is the most often used and the most important measure of variability.
➢ It can help us predict how data points are distributed about the mean.
➢ Variance: The Variance is the square of the standard deviation.
➢ There are several abbreviations for standard deviation. We will use
“ s “ for a sample standard deviation; and a lowercase
Sigma “σ” for a population standard deviation.
POPULATION STANDARD DEVIATION “σ”
➢ The population standard deviation for ungrouped data is the square root of the arithmetic
mean of the squared deviations from the population mean.
( )
N
X
 −
=
2


SAMPLE STANDARD DEVIATION “s”
➢ Sample standard uses different notation and a slightly different formula.
2
( )
1
X X
s
n
−
=
−

Note the use of X-bar (X) rather than mu and the denominator is (n-1) rather than N.
* We can use the TI-84 calculator to help us calculate the standard deviation for the data
In Example 1:
The table displays the quantities needed for calculating the sample variance and sample
standard deviation.
The numerator of 𝑠2
is
**The prior calculation is definitional and tedious. A shortcut is derived here and involves just
2 sums as follow,
we calculate the sample variance and standard deviation in the previous example using the
shortcut method.
Example 2.
A sample of ages was taken from students in an EFL course. The data set was as follows,
34, 32, 19, 22, 24, 32, 25, 23
Find the mean, variance and standard deviation.
x X x− X (x− X)2
34 26.375 7.625 58.141
32 26.375 5.625 31.641
19 26.375 -7.375 54.391
22 26.375 -4.375 19.141
24 26.375 -2.375 5.641
32 26.375 5.625 31.641
25 26.375 -1.375 1.891
23 26.375 -3.375 11.391
SUM = 213.875
Mean:
211
26.375
8
= = =
X
X
n
Variance (s2
):
2
2
2
2
( )
1
213.875
7
30.55
−
=
−
=
=
 X X
s
n
s
s
Standard deviation (s):
213.875
7
5.53
=
=
s
s
– s = 0 only if all the observations are the same, otherwise
– s > 0 if there is variability in data
– s increases with the amount of variation in data. It Can be roughly
interpreted as the “typical” deviation of an observation from the
mean.
• s is not resistant to outliers
• When the sample variance is calculated with the quantity 𝒏 − 𝟏 in the denominator,
the quantity 𝒏 − 𝟏 is called the degrees of freedom
• Origin of term:
• There are 𝑛 deviations from the 𝑥̅ in the sample
• The sum of the deviations is zero
• 𝑛 − 1 of the observations can be freely determined but the 𝑛𝑡ℎ
observation is fixed to
maintain the zero sum
Sample Range:
In addition to the sample variance and sample standard deviation, the sample
range is a useful measure of variability.
In example 2: Range = 34 – 19 = 15
Example:
Find the range of the data below, -9 , 7 , 5 , 4, 1 , 8, 4, 5, 3, 3
Range =
Frequency Distributions and Histograms:
• A frequency distribution is a compact summary of data
• To construct, we must divide the range of the data into intervals, which are
usually called class intervals, cells, or bins
• Choosing number of bins approximately equal to the square root of the
number of observations often works well in practice
• After choosing number of bins, we choose the class width( interval) that
can be evaluated as follow,
Class width = Range / number of bins
Then we can find the data frequency in each class by counting the number of
observations that fall in each class.
Example 3:
The data below are the compressive strengths in pounds per square inch (psi)
of 80 specimens of a new aluminum-lithium alloy undergoing evaluation as a
possible material for aircraft structural elements
Because the data set contains 80 observations, we suspect that about eight to
nine bins will provide a satisfactory frequency distribution. The largest and
smallest data values are 245 and 76, respectively, so the bins must cover a range
of at least 245 - 76 = 169 . If we want the lower limit for the first bin to begin
slightly below the smallest data value and the upper limit for the last bin to be
slightly above the largest data value, we might start the frequency distribution at
70 and end it at 250.
Class width = Range / number of bins
= 169 / 9 = 18.7 ( we can consider it 20 )
The second row of Table contains a relative frequency distribution. The
relative frequencies are found by dividing the observed frequency in each bin
by the total number of observations. The last row of Table expresses the
relative frequencies on a cumulative basis. Frequency distributions are often
easier to interpret than tables of data. For example, it is very easy to see that
most of the specimens have compressive strengths between 130 and 190 psi
and that 97.5 percent of the specimens fall below 230 psi.
Histogram:
• A histogram is a visual display of the frequency distribution
• Provides a visual impression of the shape and distribution of the
measurements and information about the central tendency and scatter
or dispersion in the data
• Discrete variables: the frequencies are the count of the number of
observations for each possible value
• Continuous variables: define bins (ranges/classes) and count the
number of observations that fall in each bin
• Can also plot the relative frequency or proportion/fraction of times
that a value occurs (or values fall inside a bin) :
relative frequency=
frequency
total number of observations
Example: Number of classes a university student is taking data:
1 4 4 5 6 5 5 6 4 5 5 4 2 2 1 3
2 3 1 3 5 3 4 4 3 4 2 5 5 4 3 2
3 4 4 5 2 5 5 6 5 3 6 4 5 5 4 2
5 6
Frequency, relative frequency distribution
Number of
Classes
Relative
Frequency
1 3 3/50=0.06
2 7 7/50=0.14
3 8 8/50=0.16
4 12 12/50=0.24
5 15 15/50=0.30
6 5 5/50=0.10
Total 50 1.00
Median > Mean Median = Mean Median < Mean
Negative skewed Positive skewed
Symmetric
(bell shape)
Histograms are the most common method for graphically displaying and determine
the shape of distribution and the existence of outliers.
– For symmetric distributions one half of the distribution is a mirror
image of the other.
– Skewed distributions: Negative/Left-skewed, Positive/Right-
skewed
Outlier:
Boxplots
• The box plot is a graphical display that simultaneously describes several
important features of a data set, such as center, spread, departure from
symmetry, and identification of unusual observations or outliers
• Sometimes called box – and – whisker plots
• Displays three quartiles
• A line, or whisker, extends from each end of the box
Description of the Box Plot
• The pth percentile is the number that divides the bottom p% of the
data from the top (100-p)%
• Q1: 25th percentile (lower quartile/fourth)
• Q3: 75th percentile (upper quartile/fourth)
• Median = Q2 → 50th percentile
• Five number summary: Minimum, Q1, Median, Q3, Maximum
Calculation of Q1 and Q3 (by hand)*
• Q1: median of the bottom half of the (ordered) data set
• Q3: median of the top half of the (ordered) data set
• The fourth spread denoted as fs or interquartile range
(IQR) is a resistant measure of spread, IQR = Q3 − Q1
• Can be used to identify any observations that may be potential
outliers:
– Mild: more than 1.5 × IQR from closest quartile
– Extreme: more than 3 × IQR from closest quartile
• The boxplot is a graphical representation of the five number
summary.
The box plot and
five number summary
for example 3.
Comparative box plots of a quality index at three plants
• Boxplots can be useful for comparing the distribution between
two (or more) groups.
The graph shows the comparative box plots for a manufacturing quality index
on semiconductor devices at three manufacturing plants. Inspection of this
display reveals that there is too much variability at plant 2 and that plants 2 and
3 need to raise their quality index performance.
Practice:
Measurement of total nitrogen loads from a particular Chesapeake Bay
location
Raw data
• Five number summary: Min, Q1, Q2 , Q3, Max
• 9.69, 44.075, 92.17, 175.145, 1529.35
9.69 13.16 17.09 18.12 23.70 24.07 24.29 26.43
30.75 31.54 35.07 36.99 40.32 42.51 45.64 48.22
49.98 50.06 55.02 57.00 58.41 61.31 64.25 65.24
66.14 67.68 81.40 90.80 92.17 92.42 100.82 101.94
103.61 106.28 106.80 108.69 114.61 120.86 124.54 143.27
143.75 149.64 167.79 182.50 192.55 193.53 271.57 292.61
312.45 352.09 371.47 444.68 460.86 563.92 690.11 826.54
1529.35
IQR =
Upper limit = Q3 + (1.5× IQR)= (mild)
Lower limit = Q1 - (1.5× IQR)= (Mild)
Mild outliers:
Extreme outlier limits:
Upper limit = Q3 + (3× IQR)=
Lower limit = Q1 - (3× IQR) =
Time Sequence Plot:
• A time series or time sequence is a data set in which the observations are recorded in
the order in which they occur.
• A time series plot is a graph in which the vertical axis denotes the observed value of
variable and the horizontal axis denotes the time
Scatter Diagrams
:
• Multivariate: each observation consists of measurements of several variables
• The scatter diagram is a useful way to graphically display the potential relationship
between quality and one of the other qualities
• When two or more variables exist, the matrix of scatter diagrams may be useful in
looking at all of the pairwise relationships between the variables in the sample
• The sample correlation coefficient is a quantitative measure of the strength of the linear
relationship between two random variables x and y
1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf

More Related Content

Similar to 1.0 Descriptive statistics.pdf

Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematicshktripathy
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics Bahzad5
 
Chapter 4 MMW.pdf
Chapter 4 MMW.pdfChapter 4 MMW.pdf
Chapter 4 MMW.pdfRaRaRamirez
 
CABT Math 8 measures of central tendency and dispersion
CABT Math 8   measures of central tendency and dispersionCABT Math 8   measures of central tendency and dispersion
CABT Math 8 measures of central tendency and dispersionGilbert Joseph Abueg
 
1) Chapter#02 Presentation of Data.ppt
1) Chapter#02 Presentation of Data.ppt1) Chapter#02 Presentation of Data.ppt
1) Chapter#02 Presentation of Data.pptMuntazirMehdi43
 
Statistics final seminar
Statistics final seminarStatistics final seminar
Statistics final seminarTejas Jagtap
 
2-L2 Presentation of data.pptx
2-L2 Presentation of data.pptx2-L2 Presentation of data.pptx
2-L2 Presentation of data.pptxssuser03ba7c
 
3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdfAmanuelDina
 
Lect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data MiningLect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data Mininghktripathy
 
Descriptions of data statistics for research
Descriptions of data   statistics for researchDescriptions of data   statistics for research
Descriptions of data statistics for researchHarve Abella
 
Biostatistics cource for clinical pharmacy
Biostatistics cource for clinical pharmacyBiostatistics cource for clinical pharmacy
Biostatistics cource for clinical pharmacyBatizemaryam
 
Measures of dispersion qt pgdm 1st trisemester
Measures of dispersion qt pgdm 1st trisemester Measures of dispersion qt pgdm 1st trisemester
Measures of dispersion qt pgdm 1st trisemester Karan Kukreja
 
Topic-1-Review-of-Basic-Statistics.pptx
Topic-1-Review-of-Basic-Statistics.pptxTopic-1-Review-of-Basic-Statistics.pptx
Topic-1-Review-of-Basic-Statistics.pptxJohnLester81
 
SAMPLING MEAN DEFINITION The term sampling mean is.docx
SAMPLING MEAN  DEFINITION  The term sampling mean is.docxSAMPLING MEAN  DEFINITION  The term sampling mean is.docx
SAMPLING MEAN DEFINITION The term sampling mean is.docxagnesdcarey33086
 
4. six sigma descriptive statistics
4. six sigma descriptive statistics4. six sigma descriptive statistics
4. six sigma descriptive statisticsHakeem-Ur- Rehman
 
Measures of Dispersion.pptx
Measures of Dispersion.pptxMeasures of Dispersion.pptx
Measures of Dispersion.pptxVanmala Buchke
 
Data Management_new.pptx
Data Management_new.pptxData Management_new.pptx
Data Management_new.pptxDharenOla3
 
Advanced Statistics And Probability (MSC 615
Advanced Statistics And Probability (MSC 615Advanced Statistics And Probability (MSC 615
Advanced Statistics And Probability (MSC 615Maria Perkins
 

Similar to 1.0 Descriptive statistics.pdf (20)

Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
 
Chapter 4 MMW.pdf
Chapter 4 MMW.pdfChapter 4 MMW.pdf
Chapter 4 MMW.pdf
 
CABT Math 8 measures of central tendency and dispersion
CABT Math 8   measures of central tendency and dispersionCABT Math 8   measures of central tendency and dispersion
CABT Math 8 measures of central tendency and dispersion
 
1) Chapter#02 Presentation of Data.ppt
1) Chapter#02 Presentation of Data.ppt1) Chapter#02 Presentation of Data.ppt
1) Chapter#02 Presentation of Data.ppt
 
Descriptive Analytics: Data Reduction
 Descriptive Analytics: Data Reduction Descriptive Analytics: Data Reduction
Descriptive Analytics: Data Reduction
 
Statistics final seminar
Statistics final seminarStatistics final seminar
Statistics final seminar
 
5.DATA SUMMERISATION.ppt
5.DATA SUMMERISATION.ppt5.DATA SUMMERISATION.ppt
5.DATA SUMMERISATION.ppt
 
2-L2 Presentation of data.pptx
2-L2 Presentation of data.pptx2-L2 Presentation of data.pptx
2-L2 Presentation of data.pptx
 
3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf
 
Lect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data MiningLect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data Mining
 
Descriptions of data statistics for research
Descriptions of data   statistics for researchDescriptions of data   statistics for research
Descriptions of data statistics for research
 
Biostatistics cource for clinical pharmacy
Biostatistics cource for clinical pharmacyBiostatistics cource for clinical pharmacy
Biostatistics cource for clinical pharmacy
 
Measures of dispersion qt pgdm 1st trisemester
Measures of dispersion qt pgdm 1st trisemester Measures of dispersion qt pgdm 1st trisemester
Measures of dispersion qt pgdm 1st trisemester
 
Topic-1-Review-of-Basic-Statistics.pptx
Topic-1-Review-of-Basic-Statistics.pptxTopic-1-Review-of-Basic-Statistics.pptx
Topic-1-Review-of-Basic-Statistics.pptx
 
SAMPLING MEAN DEFINITION The term sampling mean is.docx
SAMPLING MEAN  DEFINITION  The term sampling mean is.docxSAMPLING MEAN  DEFINITION  The term sampling mean is.docx
SAMPLING MEAN DEFINITION The term sampling mean is.docx
 
4. six sigma descriptive statistics
4. six sigma descriptive statistics4. six sigma descriptive statistics
4. six sigma descriptive statistics
 
Measures of Dispersion.pptx
Measures of Dispersion.pptxMeasures of Dispersion.pptx
Measures of Dispersion.pptx
 
Data Management_new.pptx
Data Management_new.pptxData Management_new.pptx
Data Management_new.pptx
 
Advanced Statistics And Probability (MSC 615
Advanced Statistics And Probability (MSC 615Advanced Statistics And Probability (MSC 615
Advanced Statistics And Probability (MSC 615
 

Recently uploaded

HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 

Recently uploaded (20)

HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 

1.0 Descriptive statistics.pdf

  • 1.
  • 3. Overview and Descriptive Statistics introduction • Statistical concepts and methods are critical to understanding the world around us. • The science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making more effective decisions – Allow us to make informed decisions based upon data in the presence of uncertainty and variation • Statistical methods have broad applications – Evaluating research – Making predictions – Understanding variability among components manufactured by a certain process
  • 4. Numerical Summaries of Data • Data summaries and displays are essential to good statistical thinking • It is useful to describe data features numerically • Characterizing the location or central tendency in the data is an example of a numerical summary • Data are often a sample of observations that have been selected from some larger population of observations • The population is the collection of all individuals or objects under consideration in a statistical study • Obtaining information on entire population (census) often impractical • Instead use a sample or subset (part) of the population to obtain information
  • 5. • parameter is a descriptive measure for a population • e.g. population mean, µ, population standard deviation, σ A statistic is a descriptive measure for a sample • e.g. About 33% of people in our sample were between 30 and 40 years old. • Used to estimate population parameters – e.g. sample mean, x̄, sample standard deviation, s Practice: 1.Determine whether each group is best described as a sample or a population: a)The participants in a study of a new cholesterol drug. b)All the people living in Doha.
  • 6. 2.Determine whether each situation is a sample or a census: a)The government of Qatar collects information from all residents about their income. b)The government of Qatar asks 1000 residents of Doha where they like to shop. c)Data about the effectiveness of a new vaccine that has been given to volunteers. d)The types of cars driven by all residents of Doha. Measures of Center ( Mean , Median, Mode) - Sample mean The location or central tendency in the data can be characterized by the arithmetic average or the sample mean. Calculated by adding up all values in the
  • 7. data set, then dividing it by the number of values that are in the data set. Mathematically, the formula is written as =  X X n X This is read as X bar. Sample mean. n Sample size. X Random variable to denote the items in a sample. - Population Mean: X N  =   Greek letter mu, always used to denote population mean. N Population size. Capital N is used for population size. X Random variable to denote the items in a population. ➢A different symbol is used for measure if the data is from a sample.
  • 8. Example 1: - The Median ➢It is the middle value once the data have been sorted into ascending or descending order. ➢If we have an odd number of data points, it is the middle value. 3, 5, 8, 12, 17, 18, 19
  • 9. ➢If we have an even number of data points the median is the mean of the middle two points. Therefore, the median does not have to be one of the data points. 12, 8, 5, 3, 21, 18, 19, 17 i. We first have to order the numbers ii. 3, 5, 8, 12, 17, 18, 19, 21 iii. The median is 12 17 14.5 2 + = The Mode: ➢This is the value in the data set with the greatest frequency. It is possible to vae mopre than one mode in a data set. Example: Consider the following data sets. What is the mode? 12, 13, 14, 15, 12, 4, 2, 6, 14, 13, 15, 12, 4, 12 It is helpful to sort the list first.:2 4 4 6 12 12 12 12 13 13 14 14 15 15
  • 10. Mode is 12 Example: Consider the following data sets. What is the mode? 12, 13, 14, 15, 12, 4, 2, 15, 6, 14, 13, 15, 12, 4, 12, 15 It is helpful to sort the list first.:2 4 4 6 12 12 12 12 13 13 14 14 15 15 15 15 Mode is 12 15 and Roles of the mean, median and mode ➢The mean, median and mode are useful measures depending on what it is being used for. Choosing the best measure of central tendency depends on the type of data you have. Examples: 1)Suppose the annual salaries of a sample of Accountants $62,900, $61,600, $62,500, $60,800, and $1,200,000.
  • 11. • The mean salary is $289,560. All accountants have an income between $60,000 to $63,000 except for the last with $1.2 million salary. This salary is affecting the mean calculation. Obviously, it is not a representative average of this group of workers. • For data containing one or two very large or very small values, the mean may not be representative. The center for such data can be better represented by the median. $60,800, $61,600, $62,500, $62,900, $1,200,000 $62,500 would be a more representative value for the average salary. Measures of Variability • Measures of center only give partial information abouta distribution • Consider the following three samples:
  • 12. A 8 9 10 11 12 B 4 8 10 12 16 C 4 7 10 13 16 Standard Deviation and Variance ➢ The standard deviation is the most often used and the most important measure of variability. ➢ It can help us predict how data points are distributed about the mean. ➢ Variance: The Variance is the square of the standard deviation. ➢ There are several abbreviations for standard deviation. We will use “ s “ for a sample standard deviation; and a lowercase Sigma “σ” for a population standard deviation.
  • 13. POPULATION STANDARD DEVIATION “σ” ➢ The population standard deviation for ungrouped data is the square root of the arithmetic mean of the squared deviations from the population mean. ( ) N X  − = 2   SAMPLE STANDARD DEVIATION “s” ➢ Sample standard uses different notation and a slightly different formula. 2 ( ) 1 X X s n − = −  Note the use of X-bar (X) rather than mu and the denominator is (n-1) rather than N. * We can use the TI-84 calculator to help us calculate the standard deviation for the data In Example 1:
  • 14. The table displays the quantities needed for calculating the sample variance and sample standard deviation. The numerator of 𝑠2 is **The prior calculation is definitional and tedious. A shortcut is derived here and involves just 2 sums as follow,
  • 15. we calculate the sample variance and standard deviation in the previous example using the shortcut method. Example 2. A sample of ages was taken from students in an EFL course. The data set was as follows, 34, 32, 19, 22, 24, 32, 25, 23
  • 16. Find the mean, variance and standard deviation. x X x− X (x− X)2 34 26.375 7.625 58.141 32 26.375 5.625 31.641 19 26.375 -7.375 54.391 22 26.375 -4.375 19.141 24 26.375 -2.375 5.641 32 26.375 5.625 31.641 25 26.375 -1.375 1.891 23 26.375 -3.375 11.391 SUM = 213.875 Mean: 211 26.375 8 = = = X X n Variance (s2 ): 2 2 2 2 ( ) 1 213.875 7 30.55 − = − = =  X X s n s s
  • 17. Standard deviation (s): 213.875 7 5.53 = = s s – s = 0 only if all the observations are the same, otherwise – s > 0 if there is variability in data – s increases with the amount of variation in data. It Can be roughly interpreted as the “typical” deviation of an observation from the mean. • s is not resistant to outliers • When the sample variance is calculated with the quantity 𝒏 − 𝟏 in the denominator, the quantity 𝒏 − 𝟏 is called the degrees of freedom • Origin of term:
  • 18. • There are 𝑛 deviations from the 𝑥̅ in the sample • The sum of the deviations is zero • 𝑛 − 1 of the observations can be freely determined but the 𝑛𝑡ℎ observation is fixed to maintain the zero sum Sample Range: In addition to the sample variance and sample standard deviation, the sample range is a useful measure of variability. In example 2: Range = 34 – 19 = 15
  • 19. Example: Find the range of the data below, -9 , 7 , 5 , 4, 1 , 8, 4, 5, 3, 3 Range = Frequency Distributions and Histograms: • A frequency distribution is a compact summary of data • To construct, we must divide the range of the data into intervals, which are usually called class intervals, cells, or bins • Choosing number of bins approximately equal to the square root of the number of observations often works well in practice • After choosing number of bins, we choose the class width( interval) that can be evaluated as follow, Class width = Range / number of bins Then we can find the data frequency in each class by counting the number of observations that fall in each class.
  • 20. Example 3: The data below are the compressive strengths in pounds per square inch (psi) of 80 specimens of a new aluminum-lithium alloy undergoing evaluation as a possible material for aircraft structural elements Because the data set contains 80 observations, we suspect that about eight to nine bins will provide a satisfactory frequency distribution. The largest and smallest data values are 245 and 76, respectively, so the bins must cover a range
  • 21. of at least 245 - 76 = 169 . If we want the lower limit for the first bin to begin slightly below the smallest data value and the upper limit for the last bin to be slightly above the largest data value, we might start the frequency distribution at 70 and end it at 250. Class width = Range / number of bins = 169 / 9 = 18.7 ( we can consider it 20 ) The second row of Table contains a relative frequency distribution. The relative frequencies are found by dividing the observed frequency in each bin by the total number of observations. The last row of Table expresses the
  • 22. relative frequencies on a cumulative basis. Frequency distributions are often easier to interpret than tables of data. For example, it is very easy to see that most of the specimens have compressive strengths between 130 and 190 psi and that 97.5 percent of the specimens fall below 230 psi. Histogram: • A histogram is a visual display of the frequency distribution • Provides a visual impression of the shape and distribution of the measurements and information about the central tendency and scatter or dispersion in the data • Discrete variables: the frequencies are the count of the number of observations for each possible value • Continuous variables: define bins (ranges/classes) and count the number of observations that fall in each bin • Can also plot the relative frequency or proportion/fraction of times
  • 23. that a value occurs (or values fall inside a bin) : relative frequency= frequency total number of observations Example: Number of classes a university student is taking data: 1 4 4 5 6 5 5 6 4 5 5 4 2 2 1 3 2 3 1 3 5 3 4 4 3 4 2 5 5 4 3 2 3 4 4 5 2 5 5 6 5 3 6 4 5 5 4 2 5 6
  • 24. Frequency, relative frequency distribution Number of Classes Relative Frequency 1 3 3/50=0.06 2 7 7/50=0.14 3 8 8/50=0.16 4 12 12/50=0.24 5 15 15/50=0.30 6 5 5/50=0.10 Total 50 1.00 Median > Mean Median = Mean Median < Mean Negative skewed Positive skewed Symmetric (bell shape)
  • 25. Histograms are the most common method for graphically displaying and determine the shape of distribution and the existence of outliers. – For symmetric distributions one half of the distribution is a mirror image of the other. – Skewed distributions: Negative/Left-skewed, Positive/Right- skewed Outlier:
  • 26. Boxplots • The box plot is a graphical display that simultaneously describes several important features of a data set, such as center, spread, departure from symmetry, and identification of unusual observations or outliers • Sometimes called box – and – whisker plots • Displays three quartiles • A line, or whisker, extends from each end of the box Description of the Box Plot
  • 27. • The pth percentile is the number that divides the bottom p% of the data from the top (100-p)% • Q1: 25th percentile (lower quartile/fourth) • Q3: 75th percentile (upper quartile/fourth) • Median = Q2 → 50th percentile • Five number summary: Minimum, Q1, Median, Q3, Maximum Calculation of Q1 and Q3 (by hand)* • Q1: median of the bottom half of the (ordered) data set • Q3: median of the top half of the (ordered) data set
  • 28. • The fourth spread denoted as fs or interquartile range (IQR) is a resistant measure of spread, IQR = Q3 − Q1 • Can be used to identify any observations that may be potential outliers: – Mild: more than 1.5 × IQR from closest quartile – Extreme: more than 3 × IQR from closest quartile • The boxplot is a graphical representation of the five number summary. The box plot and five number summary for example 3.
  • 29. Comparative box plots of a quality index at three plants • Boxplots can be useful for comparing the distribution between two (or more) groups. The graph shows the comparative box plots for a manufacturing quality index on semiconductor devices at three manufacturing plants. Inspection of this
  • 30. display reveals that there is too much variability at plant 2 and that plants 2 and 3 need to raise their quality index performance. Practice: Measurement of total nitrogen loads from a particular Chesapeake Bay location Raw data • Five number summary: Min, Q1, Q2 , Q3, Max • 9.69, 44.075, 92.17, 175.145, 1529.35 9.69 13.16 17.09 18.12 23.70 24.07 24.29 26.43 30.75 31.54 35.07 36.99 40.32 42.51 45.64 48.22 49.98 50.06 55.02 57.00 58.41 61.31 64.25 65.24 66.14 67.68 81.40 90.80 92.17 92.42 100.82 101.94 103.61 106.28 106.80 108.69 114.61 120.86 124.54 143.27 143.75 149.64 167.79 182.50 192.55 193.53 271.57 292.61 312.45 352.09 371.47 444.68 460.86 563.92 690.11 826.54 1529.35
  • 31. IQR = Upper limit = Q3 + (1.5× IQR)= (mild) Lower limit = Q1 - (1.5× IQR)= (Mild) Mild outliers: Extreme outlier limits: Upper limit = Q3 + (3× IQR)= Lower limit = Q1 - (3× IQR) = Time Sequence Plot: • A time series or time sequence is a data set in which the observations are recorded in the order in which they occur. • A time series plot is a graph in which the vertical axis denotes the observed value of variable and the horizontal axis denotes the time
  • 32. Scatter Diagrams : • Multivariate: each observation consists of measurements of several variables • The scatter diagram is a useful way to graphically display the potential relationship between quality and one of the other qualities • When two or more variables exist, the matrix of scatter diagrams may be useful in looking at all of the pairwise relationships between the variables in the sample • The sample correlation coefficient is a quantitative measure of the strength of the linear relationship between two random variables x and y