Introduction to IEEE STANDARDS and its different types.pptx
1.IntroDescriptiveDisplay-20222023WS.pdf
1. Welcome and Introduction
Introduction to Descriptive Statistics
Methods of Displaying Data
Prof. Dr. Constantinos Antoniou
Chair of Transportation Systems Engineering
c.antoniou@tum.de
Tuesday, October 18, 2022
Applied Statistics in Transport
2. Prof. Dr. C. Antoniou
c.antoniou@tum.de
Practical information - Lecturers
Mohamed Abouelela
mohamed.abouelela@tum.de
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 1
3. Course Topics
1. Introduction to descriptive statistics
2. Methods of displaying data
3. Probability theory and important distributions
4. Confidence intervals and sample sizes
5. Statistical testing/ hypothesis testing
6. Correlation and regression
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 2
4. Credits
• Some lectures rely on material from Prof. Haris N.
Koutsopoulos (Northeastern University), Prof. Petros
Vythoulkas, and the book Washington, Karlaftis and Mannering
(2003, 2009)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 3
8. Uncertainty
• Values are not the same under the same conditions
– Peak traffic flows
– Annual rainfalls
– Steel yield strengths
– 911 emergency calls
– Number of people served at a bank window
• Variability
– Important implications for
• Decision making
• Design
• Operations
• Tools for studying and dealing with uncertainty
– Probability and statistics
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 7
9. Wait/Walk dilemma
• Waiting for a bus at a stop
– Duration of the wait may exceed the time to walk
to your destination
• 2008 "Year in Ideas“, The New York Times
Magazine
– Thompson, Clive (2008-12-13). "The Bus-Wait
Formula"
• Wikipedia
– https://en.wikipedia.org/wiki/Wait/walk_dilemma
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 8
11. Information has gone from scarce to
superabundant. That brings huge new
benefits but also big headaches.
Economist, Feb. 2010
Explosion in Data Availability
Source: TomTom
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 10
12. Challenges
Data can be very noisy
• Measurement errors
• Other sources
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 11
13. Big Data – the three (four, five, …) V’s
• Volume:
– Increasingly massive datasets hard to manage
– Large Hadron Collider experiment, 150 million
sensors delivering data 40 million times per
second.
• Variety:
– Data complexity is growing
– More types of data captured than ever before,
quantification of self etc.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 12
14. Big Data – the three (four, five, …) V’s
• Velocity:
– Some data is arriving so rapidly it must be either processed
instantly or lost
– Whole subfield of ‘streaming data’
• Veracity?
• Value?
• …
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 13
15. Impact of Big Data
Big Data promises to revolutionize numerous
areas:
• Big science:
ØPersonalized genomics
ØMeteorology
• Entertainment:
Ø Netflix recommender system, $1,000,000 challenge
to improve system
Ø Hit show ‘House of Cards’ designed based on
analysis
• …
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 14
17. Machine Learning
• The massive size of Big Data sets are too large
for a human to analyze
• Require computers that can learn the
structure and patterns in the data to extract
meaningful insights and applications
• Machine learning and Big Data are
inextricably linked
• ML hard to define: contains Elements of
Artificial Intelligence, Statistics, Computer
Science, Control Theory, Engineering
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 16
18. So what?
• How can data help plan, manage and operate transportation
systems?
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 17
19. Skills needed in
data science
[National
Institute of
Standards (NIST)]
Source: NIST Big Data. "Draft NIST Big Data Interoperability Framework, Volume 1", 2014.
http://docplayer.net/7239072-Draft-nist-big-data-interoperability-framework-volume-1-definitions.html.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 18
22. CRISP-DM Process
Model for Data Mining
Source:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.198.5133&rep=rep1&type=pdf
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 21
24. The Human Centered KDD Process and the SEMMA
Methodological Steps
Source: Mariscal, Gonzalo, Oscar Marban,
and Covadonga Fernandez. "A survey of data
mining and knowledge discovery process
models and methodologies."The Knowledge
Engineering Review 25.02 (2010): 137-166
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 23
25. The SEMMA Model Development Process
Source: http://www.sas.com/content/dam/SAS/en_gb/doc/other1/events/sasforum/slides/manchester-
day2/I.%20Brown%20Data%20Exploration%20and%20Visualisation%20in%20SAS%20EM_IB.pdf
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 24
26. Guide to Analytic Selection
(Booz Allen & Hamilton)
Source: http://www.boozallen.com/insights/2015/12/data-science-field-guide-second-edition.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 25
27. Degree of Intelligence in Data Analytics
Source: Adapted from: Davenport, Thomas H., and Jeanne G. Harris. Competing on analytics: The new
science of winning. Harvard Business Press, 2007
Analysis
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 26
28. Data preparation
• Our data has to follow our assumptions for x and y
• All sorts of little tasks
– Parse datasets
– Convert value types (e.g. numeric to nominal)
– Eliminate errors, (useless) outliers
– Obtain intermediate values (e.g. xn+1=f(x1,x2))
• Descriptive statistics
• This is where we spend MOST of the time! Some people say
90%...
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 27
29. Data analysts - the bad news
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 28
30. Data analysts - the “good” news
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 29
32. Probability
• Probability is a numerical measure of the likelihood that an event
will occur
• Probability values are always assigned on a scale from 0 to 1
• A probability near 0 indicates an event is very unlikely to occur
• A probability near 1 indicates an event is almost certain to occur
0 1
.5
Increasing Likelihood of Occurrence
Probability:
The occurrence of the event is
just as likely as it is unlikely
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 31
33. Humans are very bad at understanding probabilities
Financial Crisis
2008.10.15
-7.87%
Great
Depression
Black
Monday
1987.10.19
-22.61%
Financial Crisis
2008.10.13, 10.28
+11.09%, +10.88
Great
Depression
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 32
34. We Fear Spectacular, Unlikely Events
• “After 9/11, 1.4 million people changed their holiday travel
plans to avoid flying. The vast majority chose to drive instead.
• But driving is far more dangerous than flying, and the decision
to switch caused roughly 1,000 additional auto fatalities,
according to two separate analyses comparing traffic patterns
in late 2001 to those the year before.
• In other words, 1,000 people who chose to drive wouldn't have
died had they flown instead.”
https://www.psychologytoday.com/articles/200801/10-ways-we-get-the-odds-wrong
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 33
35. We roll 6 (fair) dice
Which of the following sequences is more likely to occur?
• 1-1-1-1-1-1
• 1-2-3-4-5-6
• 4-3-6-5-4-2
• 2-3-6-4-5-1
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 34
36. Gambler’s fallacy (Monte Carlo, 1913)
• The gambler's fallacy, also known as the Monte Carlo fallacy or the fallacy
of the maturity of chances, is the mistaken belief that, if something
happens more frequently than normal during some period, it will happen
less frequently in the future
• In a game of roulette at the Monte Carlo Casino on August 18, 1913, when
the ball fell in black 26 times in a row (an extremely uncommon
occurrence, although no more or less common than any of the other
67,108,863 sequences of 26 red or black), gamblers lost millions of francs
betting against black, reasoning incorrectly that the streak was causing an
"imbalance" in the randomness of the wheel, and that it had to be
followed by a long streak of red.
https://en.wikipedia.org/wiki/Gambler%27s_fallacy
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 35
37. Statistics
• The analysis of data for the purpose of reaching a decision or
gaining insight into the behavior of many phenomena
• Examples
– Weather
– Pollution/contamination
– Soil condition
– Traffic
– Marketing
– Design of facilities
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 36
38. Probability and Statistics
Probability is deductive
Probability: Given the
information in the box, what is
in your hand?
Statistics: Given the information
in your hand, what is the box?
Statistics is inductive
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 37
40. Populations and Samples
A population is a well-defined
collection of objects or units of
interest
A subset of the population
is a sample
Sample
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 39
41. Reasons for Drawing a Sample
• Less time consuming
• …
• …
Sample
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 40
42. Reasons for Drawing a Sample
• Less time consuming
• Less costly to administer
• Less cumbersome and more practical
Sample
Sample
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 41
43. Types of Data
A variable is discrete if its set of possible values
constitute a finite set or an infinite sequence
A variable is continuous if its set of possible values
consists of an entire interval on a number line
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 42
44. Statistics
Descriptive Statistics – summary and
description of collected data
Inferential Statistics – generalizing from
a sample to a population
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 43
45. Presenting Data
• For data to be useful, it must
be summarized
– Tables
– Graphs and charts
• Visualization
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 44
46. Visualization
• Floating car data in Stockholm
• Public transportation passenger movements in London
• Profiles of public transport users
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 45
50. Visualisation – increased needs
• Traditionally it was “easy” to
look at the model inputs and
outputs
• Interpretation and analysis
• To understand Big data we
need a lot of work and the
development of new
strategies
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 49
54. Virtual/augmented reality
CAVE (no need for glasses)
• LRZ Virtual Reality and
Visualisation Centre (V2C)
• LRZ Holobench
More accessible technologies
• Oculus Rift, etc.
• Upcoming versions will not require
powerful computer
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 53
56. Correlation
• The relationship between things that happen or change
together (Merriam-Webster Dictionary)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 55
57. Correlation
• The relationship between things that happen or change
together (Merriam-Webster Dictionary)
The Redskins moved to Washington, DC in 1937. Since then, there
have been 19 presidential elections. In 17 of those, the following
applied:
"If the Redskins win their last home game before the election, the
party that won the previous election wins the next election and that
if the Redskins lose, the challenging party's candidate wins.“
Source: https://en.wikipedia.org/wiki/Redskins_Rule
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 56
58. Correlation vs. Causality
• Correlation
– The relationship between things that happen or change together
(Merriam-Webster Dictionary)
• Causality
– The relationship between something that happens or exists and the
thing that causes it
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 57
59. Spurious regression examples (1)
http://www.tylervigen.com/spurious-correlations
Hanging
suicides
US
spending
on
science
US spending on science, space, and technology
correlates with
Suicides by hanging, strangulation and suffocation
Hanging suicides US spending on science
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
6000 suicides
8000 suicides
4000 suicides
10000 suicides
$15 billion
$20 billion
$25 billion
$30 billion
tylervigen.com
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 58
60. Spurious regression examples (2)
http://www.tylervigen.com/spurious-correlations
Nicholas
Cage
Swimming
pool
drownings
Number of people who drowned by falling into a pool
correlates with
Films Nicolas Cage appeared in
Nicholas Cage Swimming pool drownings
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
0 films
2 films
4 films
6 films
80 drownings
100 drownings
120 drownings
140 drownings
tylervigen.com
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 59
61. Spurious regression examples (3)
http://www.tylervigen.com/spurious-correlations
Bedsheet
tanglings
Cheese
consumed
Per capita cheese consumption
correlates with
Number of people who died by becoming tangled in their bedsheets
Bedsheet tanglings Cheese consumed
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
200 deaths
400 deaths
600 deaths
800 deaths
28.5lbs
30lbs
31.5lbs
33lbs
tylervigen.com
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 60
62. Spurious regression examples (4)
http://www.tylervigen.com/spurious-correlations
Murders
by
steam
Age
of
Miss
America
Age of Miss America
correlates with
Murders by steam, hot vapours and hot objects
Murders by steam Age of Miss America
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2 murders
4 murders
6 murders
8 murders
18.75 yrs
20 yrs
21.25 yrs
22.5 yrs
23.75 yrs
25 yrs
tylervigen.com
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 61
63. Spurious regression examples (5)
http://www.tylervigen.com/spurious-correlations
Number
of
people
killed
by
venomous
spiders
Spelling
Bee
winning
word
Letters in Winning Word of Scripps National Spelling Bee
correlates with
Number of people killed by venomous spiders
Number of people killed by venomous spidersSpelling Bee winning word
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
0 deaths
5 deaths
10 deaths
15 deaths
5 letters
10 letters
15 letters
tylervigen.com
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 62
64. Bradford-Hill criteria
• The Bradford Hill criteria,
otherwise known as Hill's criteria
for causation, are a group of
minimal conditions necessary to
provide adequate evidence of a
causal relationship between an
incidence and a possible
consequence, established by the
English epidemiologist Sir Austin
Bradford Hill (1897–1991) in 1965
• Strength
• Consistency
• Specificity
• Temporality
• Biological gradient
• Plausibility
• Coherence
• Experiment
• Analogy
https://en.wikipedia.org/wiki/Bradford_Hill_criteria
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 63
66. Graphical Data Representation
• Stem-and-leaf displays
• Dotplots
• Histograms
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 65
67. Stem-and-Leaf Example
9, 11, 10, 15, 22, 9, 15, 16, 24
Observed values:
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 66
68. • Select one or more leading digits for the stem
values. The trailing digits become the leaves
• List stem values in a vertical column
• Record the leaf for every observation
• Indicate the units for the stem and leaf on the
display
• Displays with 5 and 20 stems recommended
Stem-and-Leaf Example
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 67
69. 9, 11, 10, 15, 22, 9, 15, 16, 24
Observed values:
0 9 9
1 0 1 5 5 6
2 2 4
Stem: tens digit Leaf: units digit
Stem-and-Leaf Example
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 68
70. Stem-and-Leaf Example
• 40 golf courses were sampled for their lengths. The range was
from 6433 to 7280
• 7051 6470 6526 6527 6583 6694 7209 6614 6790 6770 6700
6770 6713 6870 6873 6850 6900 6927 6464 6904 7005 6433
7280 6890 7131 6605 7169 7168 7105 7113 7165 7011 6506
7040 6798 7050 6745 7022 6435 6936
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 69
73. • Identify typical value
• Extent of spread about a value
• Presence of gaps
• Extent of symmetry
• Number and location of peaks
• Presence of outlying values
Stem-and-Leaf Displays
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 72
74. Dotplots
9, 10, 15, 22, 9, 15, 16, 24,11
Observed values:
• Represent data with dots above horizontal measurement scale
• Attractive for small data sets and few distinct data values
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 73
75. 9, 10, 15, 22, 9, 15, 16, 24,11
Observed values:
5 10 15 20 25
Dotplots
• Represent data with dots above horizontal measurement scale
• Attractive for small data sets and few distinct data values
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 74
76. Histograms: Discrete Data
• Determine the frequency and relative
frequency for each value of x.
• Mark possible x values on a horizontal
scale. Above each value, draw a
rectangle whose height is the (relative)
frequency of that value.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 75
77. Histograms: Discrete Variables
• Frequency of a value
– The number of times the value occurs in the data set
• Relative frequency
– The proportion of times the value occurs
• Sum of relative frequencies of all values in sample is 1
set
data
the
in
ns
observatio
of
number
occurs
value
the
times
of
number
value
a
of
frequency
relative =
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in
Transport
76
78. Choosing a suitable bin size
https://statistics.laerd.com/statistical-guides/understanding-histograms.php
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 77
80. Histograms
• Histograms are sensitive to number of intervals
– Try different values
– Sturges’ rule
– Scott’s rule
– Rule of thumb
frequency
Imitate
Evaluate
X
Interval
width
1 3.322log
k n
= +
0.333
1.667
k n
=
ns
observatio
of
number
classes
of
number »
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 79
81. Ex. Students were asked how many train trips
they did last week. x is the variable representing
the number of trips and the results are below.
x #people
0 12
1 42
2 57
3 24
4 9
5 4
6 2
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 80
82. x #people
0 12
1 42
2 57
3 24
4 9
5 4
6 2
Rel. Freq
0.08
0.28
0.38
0.16
0.06
0.03
0.01
Frequency
Distribution
Ex. Students were asked how many train trips
they did last week. x is the variable representing
the number of trips and the results are below.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 81
83. Histograms
x Rel. Freq.
0 0.08
1 0.28
2 0.38
3 0.16
4 0.06
5 0.03
6 0.01
Trip results:
0
0,1
0,2
0,3
0,4
0 1 2 3 4 5 6
Number of Cards
Relative Frequency
Number of Trips
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 82
84. Histograms with Continuous Data: Equal
Class Widths
• Determine the class size
• Determine the frequency and relative
frequency for each class.
• Mark the class boundaries on a
horizontal measurement axis. Above
each class interval, draw a rectangle
whose height is the relative frequency.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 83
85. Guidelines: Number of classes
– More classes in larger data sets
– Between 5 and 20
– Rule of thumb
– Classes of usually equal length
ns
observatio
of
number
classes
of
number »
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in
Transport
84
86. Histogram Shapes
symmetric unimodal bimodal
positively skewed negatively skewed
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 85
87. Histograms with Qualitative Data
• Histograms can be used even with qualitative data
• Example: rating of K-12 education in California (survey data)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in
Transport
86
88. Boxplots
• Graphical representation of dispersion, skewness, outliers and other
prominent features in data using quartiles
• Construction
• Order the n observations in a data set from smallest to largest
• Separate the smallest half and the largest half (the median !
𝑥 is included in both
halves if n is odd)
• Find the lower fourth (median of the smallest half of the data)
• Find the highest fourth (median of the largest half of the data)
• Find the fourth spread fs (a measure of the spread, resistant to outliers,
fs = upper fourth – lower fourth
median
upper fourth
lower fourth
fs
min max
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 87
89. 30 40 50 60 70 80 90 100 110 120 130
Median = 90
upper fourth = 96.5
lower fourth = 72.5
Boxplot Example
median
upper fourth
lower fourth
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 88
90. Finding Outliers with Boxplots
Any observation farther than 1.5fs from
the closest fourth is an outlier.
An outlier is extreme if it is more than 3fs
from the nearest fourth, and it is mild
otherwise.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 89
95. Types of Boxplots
upper fourth
lower fourth
median
Min (non outlier)
Max (non outlier)
• Whiskers representing
• the minimum and
maximum of all of the
data, or
• the lowest datum still
within 1.5fs of the lower
quartile, and the highest
datum still within 1.5fs of
the upper quartile (Tukey
boxplot)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 94
96. NUMERICAL REPRESENTATION OF DATA
Measures of location
Measures of variability and dispersion
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 95
97. Measures of Location
Location
Mean Median Mode
Quartiles,
Percentiles
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 96
98. Summary Measures
Location
Mean Median Mode
Quartile
Geometric Mean
Summary Measures
Variation
Variance and standard Deviation
Coefficient of Variation
Range
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 97
99. Measures of Central Tendency
Central Tendency
Average Median Mode
Geometric Mean
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 98
100. Measures of Variation
Variation
Variance Standard deviation Coefficient of
Variation
Population
Variance
Sample
Variance
Population
Standard Deviation
Sample
Standard Deviation
Range
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 99
101. The Mean
The average (mean) of the numbers 1 2
, ,..., n
x x x
1 2 ... n
x x x
x
n
+ + +
= 1
n
i
i
x
n
=
=
å
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 100
102. Mean Value and Frequency Distributions
x #people
0 12
1 42
2 57
3 24
4 9
5 4
6 2
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in
Transport
101
103. The median, is the middle value in a set of
data that is arranged in ascending order. For
an even number of data points the median
is the average of the middle two.
Median
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 102
105. Mean vs. median: does it really matter?
Credit Suisse Global Wealth Report 2015
https://www.credit-suisse.com/media/mediarelease-assets/pdf/2015/10/gwr-2015-global-press-release.pdf
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 104
106. Mean vs. median: does it really matter?
Credit Suisse Global Wealth Report 2015
https://www.credit-suisse.com/media/mediarelease-assets/pdf/2015/10/gwr-2015-global-press-release.pdf
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 105
107. Which is larger?
• Can we say something about which would be larger? Median
or mean?
– Without any other information!
• Why?
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 106
108. Which is larger?
• Can we say something about which would be larger? Median or
mean?
– Without any other information!
• Why?
• Mean will in general be larger, because it can be inflated by
outliers.
• However, this is not always true, see e.g. the discussion in
https://ww2.amstat.org/publications/jse/v13n2/vonhippel.html
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 107
109. Other Measures of Location
• Median divides data set into two parts of equal
size
• Quartiles divide the data set into 4 equal parts
• Percentiles divide the data set into even finer
parts, e.g. 99%
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 108
110. Outliers
• Outliers: observations with extreme values
• Mean: sensitive to outliers
• Median: not
• Trimmed mean
• 10% trimmed: mean after eliminating the smallest 10% and largest
10% of values
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 109
111. Example: Impact of Outliers
0
623
666
744
883
898
964
970
983
1003
1016
1022
1029
1058
1085
1088
1122
1135
1197
1201
612
623
666
744
883
898
964
970
983
1003
1016
1022
1029
1058
1085
1088
1122
1135
1197
1201
• Mean =
19299/20 =
965.0
• Median =
1009.5
• Mean =
18687/20 =
934.35
• Median =
1009.5
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 110
112. Example: Impact of Outliers
• Mean = 19299/20
= 965.0
• Median = 1009.5
• Mean = 24687/20
= 1234.4
• Median = 1009.5
612
623
666
744
883
898
964
970
983
1003
1016
1022
1029
1058
1085
1088
1122
1135
1197
1201
0
623
666
744
883
898
964
970
983
1003
1016
1022
1029
1058
1085
1088
1122
1135
1197
7201
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 111
114. Measures of Variability
Variability
Variance
Standard Deviation
Coefficient of
Variation
Population
Sample
Range
Boxplots (graphical
representation) 18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 113
115. Measures of Variability
• Range
– Let xmin the minimum value in the sample, and xmax the
maximum. The range is defined as:
range = xmax - xmin
• Deviation from the mean
– Let xi be a value in the sample. Its deviation is defined as:
deviation = mean - xi
1 4
3
2 7
6
5
deviation
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in
Transport
114
116. Sample Variance
• Variance is a measure of the spread of the data
• The sample variance of the sample x1, x2, …xn of n
values of X
s2 has n – 1 degrees of freedom*
*This makes this sample variance an unbiased estimator for the population variance.
For more information, see appendix.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 115
117. Standard Deviation
Standard deviation is a measure of the spread of the data
using the same units as the data.
The sample standard deviation is the square root of the
sample variance:
𝑠 = 𝑠#
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 116
118. Example
• Mean =
• Variance =
0
10
20
30
40
50
60
0 1 2 3 4 5 6
Number of credit cards
Numbedr
of
holders
x #
0 8
1 15
2 26
3 53
4 25
5 16
6 7 Number of trips
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 117
119. Example
• Mean = = 2.993
• Variance =
19
20
21
22
23
0 1 2 3 4 5 6
Number of cards
Number
of
holders
x #
0 21
1 22
2 22
3 21
4 21
5 22
6 21
Number of trips
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 118
120. Example
• Mean = 2.987
• Variance = 2.094
• Standard deviation = 1.447
19
20
21
22
23
0 1 2 3 4 5 6
Number of cards
Number
of
holders
0
10
20
30
40
50
60
0 1 2 3 4 5 6
Number of credit cards
Numbedr
of
holders
• Mean = 2.993
• Variance = 4.00
• Standard deviation = 2.00
Number of trips
Number of trips
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 119
121. Comparing Standard Deviations
Data A
Data B
Data C
Mean = 15.5
s =
Mean = 15.5
s =
Mean = 15.5
s =
11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 120
122. Properties of Standard Deviations
1 4
3
2 7
6
5
Add 1 to all values. What happens to the mean? to the standard deviation?
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 121
123. Properties of Standard Deviations
1 4
3
2 7
6
5
1 4
3
2 7
6
5
Add 1 to all values. What happens to the mean? to the standard deviation?
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 122
124. Properties of Standard Deviations
1 4
3
2 7
6
5
Multiply all values by 2. What happens to the mean? to the standard deviation?
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 123
125. Properties of Standard Deviations
1 4
3
2 7
6
5
Multiply all values by 2. What happens to the mean? to the standard deviation?
1 4
3
2 7
6
5 8 9
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 124
126. Properties of s2
Let x1, x2,…, xn be any sample and c be any
nonzero constant.
2 2
1 1
1. If ,..., , then
n n y x
y x c y x c s s
= + = + =
2 2 2
1 1
2. If ,..., , then ,
n n y x
y cx y cx s c s
= = =
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 125
127. Population Variance
• Finite population of size N
• Variance s2
• μ: population mean
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 126
128. Coefficient of Variation, CV
Measures variance relative to the mean
When only a sample of data from a population is available,
the population CV can be estimated using the ratio of the
sample standard deviation
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 127
129. You may also hear of “moments”
• 1st moment – sample mean
• 2nd moment – variance
• 3rd moment – skewness
• 4th moment – kurtosis
• …
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 128
130. DOS AND DON’TS OF VISUALIZATION
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 129
134. Visualizing Clusters with Parallel Coordinates
Source: http://vis.pku.edu.cn/wiki/project/hdvis 18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 133
138. Heatmap vs. Parallel Coordinates
(for the Same Data)
Source: Gehlenborg, Nils, and Bang Wong. "Points of view: Heat maps." Nature Methods 9.3 (2012): 213-213.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 137
140. Bad visualization examples (“chartjunk”)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 139
141. What could we have done instead?
●
●
●
●
10
20
30
40
50
about to
do it
almost
doing it
frenching
(which will
lead to
doing it)
totally
doing it
Degree
of
doing
it
●
●
●
●
0
20
40
about to
do it
almost
doing it
frenching
(which will
lead to
doing it)
totally
doing it
Degree
of
doing
it
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 140
142. And how did we do it?
library(ggplot2)
xx=c(5,18,20,57)
x1=c("almostndoing it", "frenchingn(which
willnlead tondoing it)", "totallyndoing it",
"about tondo it")
qplot(x=x1,y=xx, xlab="", ylab="Degree of doing it”)
ggsave("dotexample1.pdf",width=6,height=4)
qplot(x=x1,y=x, xlab="", ylab="Degree of doing it")+
geom_bar(stat = "identity")
ggsave(”barexample1.pdf",width=6,height=4)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 141
143. What is the (main) problem with this figure?
http://andrewgelman.com/2015/06/22/hey-whats-up-with-that-x-axis/
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 142
144. Fixed, shows a totally different trend
http://andrewgelman.com/2015/06/22/hey-whats-up-with-that-x-axis/
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 143
153. Why not use pies?
http://www.unece.org/fileadmin/DAM/stats/documents/writi
ng/MDM_Part2_English.pdf (p. 23)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 152
156. Avoid unnecessary graphic features
http://www.unece.org/fileadmin/DAM/stats/documents/writi
ng/MDM_Part2_English.pdf (p. 29)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 155
157. Tufte’s principles
• Keep a high data-ink ratio
• Remove chart junk
• Give graphical element multiple functions
• Keep in mind the data density
• The term to search for is Information Visualization
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 156
159. Good Practices
• Clearly labeled
– Title – general subject
– Label all variables
– Units should be specified
• Identify Source of Data
• Date Data
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 158
160. Some of the things to keep in mind
• Figure may (will) be printed in black and white and/or
photocopied
• More people are colorblind than we often think!
• Projectors are not as bright as your screen
• …
• Less is more!
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 159
163. Motivation for n-1 in s2
• To explain the rationale for the divisor n – 1 in s2, note first that
whereas s2 measures sample variability, there is a measure of
variability in the population called the population variance.
• We will use s2 (the square of the lowercase Greek letter sigma)
to denote the population variance and s to denote the
population standard deviation (the square root of s2).
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 162
164. Motivation for s2
• When the population is finite and consists of N values,
• which is the average of all squared deviations from the population
mean (for the population, the divisor is N and not N – 1).
• Just as will be used to make inferences about the population
mean µ, we should define the sample variance so that it can be
used to make inferences about s2. Now note that s2 involves
squared deviations about the population mean µ.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 163
165. Motivation for s2
• If we actually knew the value of µ, then we could define the sample
variance as the average squared deviation of the sample xis about
µ.
• However, the value of µ is almost never known, so the sum of
squared deviations about must be used.
• But the xis tend to be closer to their average than to the
population average µ, so to compensate for this the divisor n – 1 is
used rather than n.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 164
166. Motivation for s2
• In other words, if we used a divisor n in the sample variance, then
the resulting quantity would tend to underestimate s2 (produce
estimated values that are too small on the average), whereas
dividing by the slightly smaller n – 1 corrects this underestimation.
• It is customary to refer to s2 as being based on n – 1 degrees of
freedom (df). This terminology reflects the fact that although s2 is
based on the n quantities
these sum to 0, so specifying the values of
any n – 1 of the quantities determines the remaining value.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 165
167. Motivation for s2
• For example, if n = 4 and
then automatically so only three of the four values
of are freely determined (3 df).
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 166