SlideShare a Scribd company logo
1 of 167
Download to read offline
Welcome and Introduction
Introduction to Descriptive Statistics
Methods of Displaying Data
Prof. Dr. Constantinos Antoniou
Chair of Transportation Systems Engineering
c.antoniou@tum.de
Tuesday, October 18, 2022
Applied Statistics in Transport
Prof. Dr. C. Antoniou
c.antoniou@tum.de
Practical information - Lecturers
Mohamed Abouelela
mohamed.abouelela@tum.de
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 1
Course Topics
1. Introduction to descriptive statistics
2. Methods of displaying data
3. Probability theory and important distributions
4. Confidence intervals and sample sizes
5. Statistical testing/ hypothesis testing
6. Correlation and regression
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 2
Credits
• Some lectures rely on material from Prof. Haris N.
Koutsopoulos (Northeastern University), Prof. Petros
Vythoulkas, and the book Washington, Karlaftis and Mannering
(2003, 2009)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 3
BACKGROUND AND MOTIVATION
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 4
Why Study
Probability
and Statistics?
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 5
https://xkcd.com/936/
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 6
Uncertainty
• Values are not the same under the same conditions
– Peak traffic flows
– Annual rainfalls
– Steel yield strengths
– 911 emergency calls
– Number of people served at a bank window
• Variability
– Important implications for
• Decision making
• Design
• Operations
• Tools for studying and dealing with uncertainty
– Probability and statistics
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 7
Wait/Walk dilemma
• Waiting for a bus at a stop
– Duration of the wait may exceed the time to walk
to your destination
• 2008 "Year in Ideas“, The New York Times
Magazine
– Thompson, Clive (2008-12-13). "The Bus-Wait
Formula"
• Wikipedia
– https://en.wikipedia.org/wiki/Wait/walk_dilemma
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 8
Source: http://www.pindropsecurity.com/data-science-how-do-we-get-started-part-one/
Explosion in Data Availability
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 9
Information has gone from scarce to
superabundant. That brings huge new
benefits but also big headaches.
Economist, Feb. 2010
Explosion in Data Availability
Source: TomTom
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 10
Challenges
Data can be very noisy
• Measurement errors
• Other sources
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 11
Big Data – the three (four, five, …) V’s
• Volume:
– Increasingly massive datasets hard to manage
– Large Hadron Collider experiment, 150 million
sensors delivering data 40 million times per
second.
• Variety:
– Data complexity is growing
– More types of data captured than ever before,
quantification of self etc.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 12
Big Data – the three (four, five, …) V’s
• Velocity:
– Some data is arriving so rapidly it must be either processed
instantly or lost
– Whole subfield of ‘streaming data’
• Veracity?
• Value?
• …
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 13
Impact of Big Data
Big Data promises to revolutionize numerous
areas:
• Big science:
ØPersonalized genomics
ØMeteorology
• Entertainment:
Ø Netflix recommender system, $1,000,000 challenge
to improve system
Ø Hit show ‘House of Cards’ designed based on
analysis
• …
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 14
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 15
Machine Learning
• The massive size of Big Data sets are too large
for a human to analyze
• Require computers that can learn the
structure and patterns in the data to extract
meaningful insights and applications
• Machine learning and Big Data are
inextricably linked
• ML hard to define: contains Elements of
Artificial Intelligence, Statistics, Computer
Science, Control Theory, Engineering
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 16
So what?
• How can data help plan, manage and operate transportation
systems?
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 17
Skills needed in
data science
[National
Institute of
Standards (NIST)]
Source: NIST Big Data. "Draft NIST Big Data Interoperability Framework, Volume 1", 2014.
http://docplayer.net/7239072-Draft-nist-big-data-interoperability-framework-volume-1-definitions.html.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 18
Data Science
Source: https://en.wikipedia.org/wiki/Data_science
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 19
Data Science
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 20
CRISP-DM Process
Model for Data Mining
Source:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.198.5133&rep=rep1&type=pdf
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 21
Source: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.198.5133&rep=rep1&type=pdf
CRISP-DM Tasks and their Outputs
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 22
The Human Centered KDD Process and the SEMMA
Methodological Steps
Source: Mariscal, Gonzalo, Oscar Marban,
and Covadonga Fernandez. "A survey of data
mining and knowledge discovery process
models and methodologies."The Knowledge
Engineering Review 25.02 (2010): 137-166
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 23
The SEMMA Model Development Process
Source: http://www.sas.com/content/dam/SAS/en_gb/doc/other1/events/sasforum/slides/manchester-
day2/I.%20Brown%20Data%20Exploration%20and%20Visualisation%20in%20SAS%20EM_IB.pdf
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 24
Guide to Analytic Selection
(Booz Allen & Hamilton)
Source: http://www.boozallen.com/insights/2015/12/data-science-field-guide-second-edition.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 25
Degree of Intelligence in Data Analytics
Source: Adapted from: Davenport, Thomas H., and Jeanne G. Harris. Competing on analytics: The new
science of winning. Harvard Business Press, 2007
Analysis
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 26
Data preparation
• Our data has to follow our assumptions for x and y
• All sorts of little tasks
– Parse datasets
– Convert value types (e.g. numeric to nominal)
– Eliminate errors, (useless) outliers
– Obtain intermediate values (e.g. xn+1=f(x1,x2))
• Descriptive statistics
• This is where we spend MOST of the time! Some people say
90%...
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 27
Data analysts - the bad news
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 28
Data analysts - the “good” news
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 29
PROBABILITY VS. STATISTICS
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 30
Probability
• Probability is a numerical measure of the likelihood that an event
will occur
• Probability values are always assigned on a scale from 0 to 1
• A probability near 0 indicates an event is very unlikely to occur
• A probability near 1 indicates an event is almost certain to occur
0 1
.5
Increasing Likelihood of Occurrence
Probability:
The occurrence of the event is
just as likely as it is unlikely
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 31
Humans are very bad at understanding probabilities
Financial Crisis
2008.10.15
-7.87%
Great
Depression
Black
Monday
1987.10.19
-22.61%
Financial Crisis
2008.10.13, 10.28
+11.09%, +10.88
Great
Depression
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 32
We Fear Spectacular, Unlikely Events
• “After 9/11, 1.4 million people changed their holiday travel
plans to avoid flying. The vast majority chose to drive instead.
• But driving is far more dangerous than flying, and the decision
to switch caused roughly 1,000 additional auto fatalities,
according to two separate analyses comparing traffic patterns
in late 2001 to those the year before.
• In other words, 1,000 people who chose to drive wouldn't have
died had they flown instead.”
https://www.psychologytoday.com/articles/200801/10-ways-we-get-the-odds-wrong
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 33
We roll 6 (fair) dice
Which of the following sequences is more likely to occur?
• 1-1-1-1-1-1
• 1-2-3-4-5-6
• 4-3-6-5-4-2
• 2-3-6-4-5-1
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 34
Gambler’s fallacy (Monte Carlo, 1913)
• The gambler's fallacy, also known as the Monte Carlo fallacy or the fallacy
of the maturity of chances, is the mistaken belief that, if something
happens more frequently than normal during some period, it will happen
less frequently in the future
• In a game of roulette at the Monte Carlo Casino on August 18, 1913, when
the ball fell in black 26 times in a row (an extremely uncommon
occurrence, although no more or less common than any of the other
67,108,863 sequences of 26 red or black), gamblers lost millions of francs
betting against black, reasoning incorrectly that the streak was causing an
"imbalance" in the randomness of the wheel, and that it had to be
followed by a long streak of red.
https://en.wikipedia.org/wiki/Gambler%27s_fallacy
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 35
Statistics
• The analysis of data for the purpose of reaching a decision or
gaining insight into the behavior of many phenomena
• Examples
– Weather
– Pollution/contamination
– Soil condition
– Traffic
– Marketing
– Design of facilities
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 36
Probability and Statistics
Probability is deductive
Probability: Given the
information in the box, what is
in your hand?
Statistics: Given the information
in your hand, what is the box?
Statistics is inductive
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 37
Inferential Statistics
Population Sample
Probability
Inferential
Statistics
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 38
Populations and Samples
A population is a well-defined
collection of objects or units of
interest
A subset of the population
is a sample
Sample
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 39
Reasons for Drawing a Sample
• Less time consuming
• …
• …
Sample
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 40
Reasons for Drawing a Sample
• Less time consuming
• Less costly to administer
• Less cumbersome and more practical
Sample
Sample
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 41
Types of Data
A variable is discrete if its set of possible values
constitute a finite set or an infinite sequence
A variable is continuous if its set of possible values
consists of an entire interval on a number line
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 42
Statistics
Descriptive Statistics – summary and
description of collected data
Inferential Statistics – generalizing from
a sample to a population
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 43
Presenting Data
• For data to be useful, it must
be summarized
– Tables
– Graphs and charts
• Visualization
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 44
Visualization
• Floating car data in Stockholm
• Public transportation passenger movements in London
• Profiles of public transport users
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 45
Visualization: Floating car data (FCD) in Stockholm
18/10/2022
46
Visualization: Public Transport Use in London
http://jaygordon.net
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 47
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 48
Visualisation – increased needs
• Traditionally it was “easy” to
look at the model inputs and
outputs
• Interpretation and analysis
• To understand Big data we
need a lot of work and the
development of new
strategies
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 49
MBTA visualisation
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 50
Simple multivariate visualisations
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 51
Interesting multivariate visualisations
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 52
Virtual/augmented reality
CAVE (no need for glasses)
• LRZ Virtual Reality and
Visualisation Centre (V2C)
• LRZ Holobench
More accessible technologies
• Oculus Rift, etc.
• Upcoming versions will not require
powerful computer
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 53
CORRELATION VS. CAUSATION
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 54
Correlation
• The relationship between things that happen or change
together (Merriam-Webster Dictionary)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 55
Correlation
• The relationship between things that happen or change
together (Merriam-Webster Dictionary)
The Redskins moved to Washington, DC in 1937. Since then, there
have been 19 presidential elections. In 17 of those, the following
applied:
"If the Redskins win their last home game before the election, the
party that won the previous election wins the next election and that
if the Redskins lose, the challenging party's candidate wins.“
Source: https://en.wikipedia.org/wiki/Redskins_Rule
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 56
Correlation vs. Causality
• Correlation
– The relationship between things that happen or change together
(Merriam-Webster Dictionary)
• Causality
– The relationship between something that happens or exists and the
thing that causes it
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 57
Spurious regression examples (1)
http://www.tylervigen.com/spurious-correlations
Hanging
suicides
US
spending
on
science
US spending on science, space, and technology
correlates with
Suicides by hanging, strangulation and suffocation
Hanging suicides US spending on science
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
6000 suicides
8000 suicides
4000 suicides
10000 suicides
$15 billion
$20 billion
$25 billion
$30 billion
tylervigen.com
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 58
Spurious regression examples (2)
http://www.tylervigen.com/spurious-correlations
Nicholas
Cage
Swimming
pool
drownings
Number of people who drowned by falling into a pool
correlates with
Films Nicolas Cage appeared in
Nicholas Cage Swimming pool drownings
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
0 films
2 films
4 films
6 films
80 drownings
100 drownings
120 drownings
140 drownings
tylervigen.com
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 59
Spurious regression examples (3)
http://www.tylervigen.com/spurious-correlations
Bedsheet
tanglings
Cheese
consumed
Per capita cheese consumption
correlates with
Number of people who died by becoming tangled in their bedsheets
Bedsheet tanglings Cheese consumed
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
200 deaths
400 deaths
600 deaths
800 deaths
28.5lbs
30lbs
31.5lbs
33lbs
tylervigen.com
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 60
Spurious regression examples (4)
http://www.tylervigen.com/spurious-correlations
Murders
by
steam
Age
of
Miss
America
Age of Miss America
correlates with
Murders by steam, hot vapours and hot objects
Murders by steam Age of Miss America
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2 murders
4 murders
6 murders
8 murders
18.75 yrs
20 yrs
21.25 yrs
22.5 yrs
23.75 yrs
25 yrs
tylervigen.com
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 61
Spurious regression examples (5)
http://www.tylervigen.com/spurious-correlations
Number
of
people
killed
by
venomous
spiders
Spelling
Bee
winning
word
Letters in Winning Word of Scripps National Spelling Bee
correlates with
Number of people killed by venomous spiders
Number of people killed by venomous spidersSpelling Bee winning word
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
0 deaths
5 deaths
10 deaths
15 deaths
5 letters
10 letters
15 letters
tylervigen.com
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 62
Bradford-Hill criteria
• The Bradford Hill criteria,
otherwise known as Hill's criteria
for causation, are a group of
minimal conditions necessary to
provide adequate evidence of a
causal relationship between an
incidence and a possible
consequence, established by the
English epidemiologist Sir Austin
Bradford Hill (1897–1991) in 1965
• Strength
• Consistency
• Specificity
• Temporality
• Biological gradient
• Plausibility
• Coherence
• Experiment
• Analogy
https://en.wikipedia.org/wiki/Bradford_Hill_criteria
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 63
GRAPHICAL DATA REPRESENTATION
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 64
Graphical Data Representation
• Stem-and-leaf displays
• Dotplots
• Histograms
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 65
Stem-and-Leaf Example
9, 11, 10, 15, 22, 9, 15, 16, 24
Observed values:
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 66
• Select one or more leading digits for the stem
values. The trailing digits become the leaves
• List stem values in a vertical column
• Record the leaf for every observation
• Indicate the units for the stem and leaf on the
display
• Displays with 5 and 20 stems recommended
Stem-and-Leaf Example
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 67
9, 11, 10, 15, 22, 9, 15, 16, 24
Observed values:
0 9 9
1 0 1 5 5 6
2 2 4
Stem: tens digit Leaf: units digit
Stem-and-Leaf Example
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 68
Stem-and-Leaf Example
• 40 golf courses were sampled for their lengths. The range was
from 6433 to 7280
• 7051 6470 6526 6527 6583 6694 7209 6614 6790 6770 6700
6770 6713 6870 6873 6850 6900 6927 6464 6904 7005 6433
7280 6890 7131 6605 7169 7168 7105 7113 7165 7011 6506
7040 6798 7050 6745 7022 6435 6936
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 69
Stem-and-Leaf Example
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 70
Stem-and-Leaf Example
64 35 64 33 70
65 26 27 06 83
66 05 94 14
67 90 70 00 98 70 45 13
68 90 70 73 50
69 00 27 36 04
70 51 05 11 40 50 22
71 31 69 68 05 13 65
72 80 09
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 71
• Identify typical value
• Extent of spread about a value
• Presence of gaps
• Extent of symmetry
• Number and location of peaks
• Presence of outlying values
Stem-and-Leaf Displays
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 72
Dotplots
9, 10, 15, 22, 9, 15, 16, 24,11
Observed values:
• Represent data with dots above horizontal measurement scale
• Attractive for small data sets and few distinct data values
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 73
9, 10, 15, 22, 9, 15, 16, 24,11
Observed values:
5 10 15 20 25
Dotplots
• Represent data with dots above horizontal measurement scale
• Attractive for small data sets and few distinct data values
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 74
Histograms: Discrete Data
• Determine the frequency and relative
frequency for each value of x.
• Mark possible x values on a horizontal
scale. Above each value, draw a
rectangle whose height is the (relative)
frequency of that value.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 75
Histograms: Discrete Variables
• Frequency of a value
– The number of times the value occurs in the data set
• Relative frequency
– The proportion of times the value occurs
• Sum of relative frequencies of all values in sample is 1
set
data
the
in
ns
observatio
of
number
occurs
value
the
times
of
number
value
a
of
frequency
relative =
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in
Transport
76
Choosing a suitable bin size
https://statistics.laerd.com/statistical-guides/understanding-histograms.php
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 77
https://www.ctspedia.org/wiki/pub/CTSpedia/BasicHistogramExamples/pic3.png
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 78
Histograms
• Histograms are sensitive to number of intervals
– Try different values
– Sturges’ rule
– Scott’s rule
– Rule of thumb
frequency
Imitate
Evaluate
X
Interval
width
1 3.322log
k n
= +
0.333
1.667
k n
=
ns
observatio
of
number
classes
of
number »
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 79
Ex. Students were asked how many train trips
they did last week. x is the variable representing
the number of trips and the results are below.
x #people
0 12
1 42
2 57
3 24
4 9
5 4
6 2
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 80
x #people
0 12
1 42
2 57
3 24
4 9
5 4
6 2
Rel. Freq
0.08
0.28
0.38
0.16
0.06
0.03
0.01
Frequency
Distribution
Ex. Students were asked how many train trips
they did last week. x is the variable representing
the number of trips and the results are below.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 81
Histograms
x Rel. Freq.
0 0.08
1 0.28
2 0.38
3 0.16
4 0.06
5 0.03
6 0.01
Trip results:
0
0,1
0,2
0,3
0,4
0 1 2 3 4 5 6
Number of Cards
Relative Frequency
Number of Trips
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 82
Histograms with Continuous Data: Equal
Class Widths
• Determine the class size
• Determine the frequency and relative
frequency for each class.
• Mark the class boundaries on a
horizontal measurement axis. Above
each class interval, draw a rectangle
whose height is the relative frequency.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 83
Guidelines: Number of classes
– More classes in larger data sets
– Between 5 and 20
– Rule of thumb
– Classes of usually equal length
ns
observatio
of
number
classes
of
number »
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in
Transport
84
Histogram Shapes
symmetric unimodal bimodal
positively skewed negatively skewed
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 85
Histograms with Qualitative Data
• Histograms can be used even with qualitative data
• Example: rating of K-12 education in California (survey data)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in
Transport
86
Boxplots
• Graphical representation of dispersion, skewness, outliers and other
prominent features in data using quartiles
• Construction
• Order the n observations in a data set from smallest to largest
• Separate the smallest half and the largest half (the median !
𝑥 is included in both
halves if n is odd)
• Find the lower fourth (median of the smallest half of the data)
• Find the highest fourth (median of the largest half of the data)
• Find the fourth spread fs (a measure of the spread, resistant to outliers,
fs = upper fourth – lower fourth
median
upper fourth
lower fourth
fs
min max
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 87
30 40 50 60 70 80 90 100 110 120 130
Median = 90
upper fourth = 96.5
lower fourth = 72.5
Boxplot Example
median
upper fourth
lower fourth
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 88
Finding Outliers with Boxplots
Any observation farther than 1.5fs from
the closest fourth is an outlier.
An outlier is extreme if it is more than 3fs
from the nearest fourth, and it is mild
otherwise.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 89
Boxplots and Outliers
median
extreme
outliers
upper fourth
lower fourth
mild
outliers
fs
≥1.5fs
≥ 3fs
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 90
40
52
55
60
70
75
85
85
90
90
92
94
94
95
98
100
115
125
125
Median
n = 19
Boxplot Example
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 91
40
52
55
60
70
75
85
85
90
90
92
94
94
95
98
100
115
125
125
Outliers Example
Median of smaller half:
Median of higher half:
72.5
96.5
Median
n = 19
fs =
1.5fs =
3.0fs =
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 92
40
52
55
60
70
75
85
85
90
90
92
94
94
95
98
100
115
125
125
Median of smaller half:
Median of higher half:
72.5
96.5
Median
n = 19
fs = 96.5 – 72.5 = 24
1.5fs = 36
3.0fs = 72
Outliers Example
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 93
Types of Boxplots
upper fourth
lower fourth
median
Min (non outlier)
Max (non outlier)
• Whiskers representing
• the minimum and
maximum of all of the
data, or
• the lowest datum still
within 1.5fs of the lower
quartile, and the highest
datum still within 1.5fs of
the upper quartile (Tukey
boxplot)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 94
NUMERICAL REPRESENTATION OF DATA
Measures of location
Measures of variability and dispersion
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 95
Measures of Location
Location
Mean Median Mode
Quartiles,
Percentiles
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 96
Summary Measures
Location
Mean Median Mode
Quartile
Geometric Mean
Summary Measures
Variation
Variance and standard Deviation
Coefficient of Variation
Range
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 97
Measures of Central Tendency
Central Tendency
Average Median Mode
Geometric Mean
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 98
Measures of Variation
Variation
Variance Standard deviation Coefficient of
Variation
Population
Variance
Sample
Variance
Population
Standard Deviation
Sample
Standard Deviation
Range
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 99
The Mean
The average (mean) of the numbers 1 2
, ,..., n
x x x
1 2 ... n
x x x
x
n
+ + +
= 1
n
i
i
x
n
=
=
å
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 100
Mean Value and Frequency Distributions
x #people
0 12
1 42
2 57
3 24
4 9
5 4
6 2
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in
Transport
101
The median, is the middle value in a set of
data that is arranged in ascending order. For
an even number of data points the median
is the average of the middle two.
Median
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 102
Example
• Sample median =
8.9
9.6
10.3
11.8
12.7
14.0
14.6
16.1
18.5
20.4
21.2
23.3
24.2
24.9
25.3
25.8
27.1
28.5
30.2
32.4
45.0
9.6
16.1
24.9
20.4
12.7
21.2
30.2
25.8
18.5
10.3
25.3
14.0
27.1
45.0
23.3
24.2
14.6
8.9
32.4
11.8
28.5
sort
8.9
9.6
10.3
11.8
12.7
14.0
14.6
16.1
18.5
20.4
21.2
23.3
24.2
24.9
25.3
25.8
27.1
28.5
30.2
32.4
45.0
21.2
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 103
Mean vs. median: does it really matter?
Credit Suisse Global Wealth Report 2015
https://www.credit-suisse.com/media/mediarelease-assets/pdf/2015/10/gwr-2015-global-press-release.pdf
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 104
Mean vs. median: does it really matter?
Credit Suisse Global Wealth Report 2015
https://www.credit-suisse.com/media/mediarelease-assets/pdf/2015/10/gwr-2015-global-press-release.pdf
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 105
Which is larger?
• Can we say something about which would be larger? Median
or mean?
– Without any other information!
• Why?
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 106
Which is larger?
• Can we say something about which would be larger? Median or
mean?
– Without any other information!
• Why?
• Mean will in general be larger, because it can be inflated by
outliers.
• However, this is not always true, see e.g. the discussion in
https://ww2.amstat.org/publications/jse/v13n2/vonhippel.html
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 107
Other Measures of Location
• Median divides data set into two parts of equal
size
• Quartiles divide the data set into 4 equal parts
• Percentiles divide the data set into even finer
parts, e.g. 99%
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 108
Outliers
• Outliers: observations with extreme values
• Mean: sensitive to outliers
• Median: not
• Trimmed mean
• 10% trimmed: mean after eliminating the smallest 10% and largest
10% of values
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 109
Example: Impact of Outliers
0
623
666
744
883
898
964
970
983
1003
1016
1022
1029
1058
1085
1088
1122
1135
1197
1201
612
623
666
744
883
898
964
970
983
1003
1016
1022
1029
1058
1085
1088
1122
1135
1197
1201
• Mean =
19299/20 =
965.0
• Median =
1009.5
• Mean =
18687/20 =
934.35
• Median =
1009.5
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 110
Example: Impact of Outliers
• Mean = 19299/20
= 965.0
• Median = 1009.5
• Mean = 24687/20
= 1234.4
• Median = 1009.5
612
623
666
744
883
898
964
970
983
1003
1016
1022
1029
1058
1085
1088
1122
1135
1197
1201
0
623
666
744
883
898
964
970
983
1003
1016
1022
1029
1058
1085
1088
1122
1135
1197
7201
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 111
Example: Trimmed Mean
• Mean = 19299/20
= 965.0
• Median = 1009.5
10% Trimmed Mean =
15666/16 = 979.12
612
623
666
744
883
898
964
970
983
1003
1016
1022
1029
1058
1085
1088
1122
1135
1197
1201
0
623
666
744
883
898
964
970
983
1003
1016
1022
1029
1058
1085
1088
1122
1135
1197
7201
612
623
666
744
883
898
964
970
983
1003
1016
1022
1029
1058
1085
1088
1122
1135
1197
1201
with outliers
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 112
Measures of Variability
Variability
Variance
Standard Deviation
Coefficient of
Variation
Population
Sample
Range
Boxplots (graphical
representation) 18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 113
Measures of Variability
• Range
– Let xmin the minimum value in the sample, and xmax the
maximum. The range is defined as:
range = xmax - xmin
• Deviation from the mean
– Let xi be a value in the sample. Its deviation is defined as:
deviation = mean - xi
1 4
3
2 7
6
5
deviation
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in
Transport
114
Sample Variance
• Variance is a measure of the spread of the data
• The sample variance of the sample x1, x2, …xn of n
values of X
s2 has n – 1 degrees of freedom*
*This makes this sample variance an unbiased estimator for the population variance.
For more information, see appendix.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 115
Standard Deviation
Standard deviation is a measure of the spread of the data
using the same units as the data.
The sample standard deviation is the square root of the
sample variance:
𝑠 = 𝑠#
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 116
Example
• Mean =
• Variance =
0
10
20
30
40
50
60
0 1 2 3 4 5 6
Number of credit cards
Numbedr
of
holders
x #
0 8
1 15
2 26
3 53
4 25
5 16
6 7 Number of trips
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 117
Example
• Mean = = 2.993
• Variance =
19
20
21
22
23
0 1 2 3 4 5 6
Number of cards
Number
of
holders
x #
0 21
1 22
2 22
3 21
4 21
5 22
6 21
Number of trips
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 118
Example
• Mean = 2.987
• Variance = 2.094
• Standard deviation = 1.447
19
20
21
22
23
0 1 2 3 4 5 6
Number of cards
Number
of
holders
0
10
20
30
40
50
60
0 1 2 3 4 5 6
Number of credit cards
Numbedr
of
holders
• Mean = 2.993
• Variance = 4.00
• Standard deviation = 2.00
Number of trips
Number of trips
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 119
Comparing Standard Deviations
Data A
Data B
Data C
Mean = 15.5
s =
Mean = 15.5
s =
Mean = 15.5
s =
11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 120
Properties of Standard Deviations
1 4
3
2 7
6
5
Add 1 to all values. What happens to the mean? to the standard deviation?
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 121
Properties of Standard Deviations
1 4
3
2 7
6
5
1 4
3
2 7
6
5
Add 1 to all values. What happens to the mean? to the standard deviation?
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 122
Properties of Standard Deviations
1 4
3
2 7
6
5
Multiply all values by 2. What happens to the mean? to the standard deviation?
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 123
Properties of Standard Deviations
1 4
3
2 7
6
5
Multiply all values by 2. What happens to the mean? to the standard deviation?
1 4
3
2 7
6
5 8 9
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 124
Properties of s2
Let x1, x2,…, xn be any sample and c be any
nonzero constant.
2 2
1 1
1. If ,..., , then
n n y x
y x c y x c s s
= + = + =
2 2 2
1 1
2. If ,..., , then ,
n n y x
y cx y cx s c s
= = =
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 125
Population Variance
• Finite population of size N
• Variance s2
• μ: population mean
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 126
Coefficient of Variation, CV
Measures variance relative to the mean
When only a sample of data from a population is available,
the population CV can be estimated using the ratio of the
sample standard deviation
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 127
You may also hear of “moments”
• 1st moment – sample mean
• 2nd moment – variance
• 3rd moment – skewness
• 4th moment – kurtosis
• …
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 128
DOS AND DON’TS OF VISUALIZATION
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 129
RMSN
v
v_front
D_front
density_all
RMSN v v_front D_front density_all
0
5
10
Corr:
−0.599
Corr:
−0.591
Corr:
−0.412
Corr:
0.141
5
10
15
Corr:
0.898
Corr:
0.659
Corr:
−0.341
5
10
15
Corr:
0.689
Corr:
−0.336
5
10
15
20
25
Corr:
−0.0127
2
3
4
5
0.0 0.3 0.6 0.9 5 10 15 5 10 15 5 10 15 20 25 2 3 4 5
Papathanasopoulou and Antoniou (2016) 18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 130
Tyrinopoulos and Antoniou (2014) 18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 131
Parallel coordinates
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 132
Visualizing Clusters with Parallel Coordinates
Source: http://vis.pku.edu.cn/wiki/project/hdvis 18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 133
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 134
Heatmaps – reordering the rows
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 135
Hierarchical clustering dendrogram
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 136
Heatmap vs. Parallel Coordinates
(for the Same Data)
Source: Gehlenborg, Nils, and Bang Wong. "Points of view: Heat maps." Nature Methods 9.3 (2012): 213-213.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 137
Treemap
Source: http://www.nytimes.com/packages/html/newsgraphics/2011/0119-budget/
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 138
Bad visualization examples (“chartjunk”)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 139
What could we have done instead?
●
●
●
●
10
20
30
40
50
about to
do it
almost
doing it
frenching
(which will
lead to
doing it)
totally
doing it
Degree
of
doing
it
●
●
●
●
0
20
40
about to
do it
almost
doing it
frenching
(which will
lead to
doing it)
totally
doing it
Degree
of
doing
it
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 140
And how did we do it?
library(ggplot2)
xx=c(5,18,20,57)
x1=c("almostndoing it", "frenchingn(which
willnlead tondoing it)", "totallyndoing it",
"about tondo it")
qplot(x=x1,y=xx, xlab="", ylab="Degree of doing it”)
ggsave("dotexample1.pdf",width=6,height=4)
qplot(x=x1,y=x, xlab="", ylab="Degree of doing it")+
geom_bar(stat = "identity")
ggsave(”barexample1.pdf",width=6,height=4)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 141
What is the (main) problem with this figure?
http://andrewgelman.com/2015/06/22/hey-whats-up-with-that-x-axis/
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 142
Fixed, shows a totally different trend
http://andrewgelman.com/2015/06/22/hey-whats-up-with-that-x-axis/
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 143
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 144
http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=00040Z
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 145
https://speakingpowerpoint.files.wordpress.com/2011/05/chartjunk-stockings1.jpg
Which one is better? (and what does better mean?)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 146
What about these?
http://hci.usask.ca/uploads/173-pap0297-bateman.pdf
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 147
http://hci.usask.ca/uploads/173-pap0297-bateman.pdf
And these?
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 148
Is there something there?
http://junkcharts.typepad.com
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 149
Is there something there?
http://junkcharts.typepad.com
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 150
http://www.eea.europa.eu/data-and-maps/daviz/learn-
more/chart-dos-and-donts
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 151
Why not use pies?
http://www.unece.org/fileadmin/DAM/stats/documents/writi
ng/MDM_Part2_English.pdf (p. 23)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 152
Why not
use pies?
http://www.eea.europa.eu/data-and-maps/daviz/learn-
more/chart-dos-and-donts 18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 153
Sort your data
http://www.unece.org/fileadmin/DAM/stats/documents/writi
ng/MDM_Part2_English.pdf (p. 27)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 154
Avoid unnecessary graphic features
http://www.unece.org/fileadmin/DAM/stats/documents/writi
ng/MDM_Part2_English.pdf (p. 29)
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 155
Tufte’s principles
• Keep a high data-ink ratio
• Remove chart junk
• Give graphical element multiple functions
• Keep in mind the data density
• The term to search for is Information Visualization
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 156
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 157
Good Practices
• Clearly labeled
– Title – general subject
– Label all variables
– Units should be specified
• Identify Source of Data
• Date Data
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 158
Some of the things to keep in mind
• Figure may (will) be printed in black and white and/or
photocopied
• More people are colorblind than we often think!
• Projectors are not as bright as your screen
• …
• Less is more!
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 159
References (figures)
• http://www.unece.org/fileadmin/DAM/stats/documents/writi
ng/MDM_Part2_English.pdf
• http://www.eea.europa.eu/data-and-maps/daviz/learn-
more/chart-dos-and-donts
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 160
APPENDIX
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 161
Motivation for n-1 in s2
• To explain the rationale for the divisor n – 1 in s2, note first that
whereas s2 measures sample variability, there is a measure of
variability in the population called the population variance.
• We will use s2 (the square of the lowercase Greek letter sigma)
to denote the population variance and s to denote the
population standard deviation (the square root of s2).
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 162
Motivation for s2
• When the population is finite and consists of N values,
• which is the average of all squared deviations from the population
mean (for the population, the divisor is N and not N – 1).
• Just as will be used to make inferences about the population
mean µ, we should define the sample variance so that it can be
used to make inferences about s2. Now note that s2 involves
squared deviations about the population mean µ.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 163
Motivation for s2
• If we actually knew the value of µ, then we could define the sample
variance as the average squared deviation of the sample xis about
µ.
• However, the value of µ is almost never known, so the sum of
squared deviations about must be used.
• But the xis tend to be closer to their average than to the
population average µ, so to compensate for this the divisor n – 1 is
used rather than n.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 164
Motivation for s2
• In other words, if we used a divisor n in the sample variance, then
the resulting quantity would tend to underestimate s2 (produce
estimated values that are too small on the average), whereas
dividing by the slightly smaller n – 1 corrects this underestimation.
• It is customary to refer to s2 as being based on n – 1 degrees of
freedom (df). This terminology reflects the fact that although s2 is
based on the n quantities
these sum to 0, so specifying the values of
any n – 1 of the quantities determines the remaining value.
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 165
Motivation for s2
• For example, if n = 4 and
then automatically so only three of the four values
of are freely determined (3 df).
18/10/2022
Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 166

More Related Content

Similar to 1.IntroDescriptiveDisplay-20222023WS.pdf

A Swiss Statistician's 'Big Tent' Overview of Big Data and Data Science in Ph...
A Swiss Statistician's 'Big Tent' Overview of Big Data and Data Science in Ph...A Swiss Statistician's 'Big Tent' Overview of Big Data and Data Science in Ph...
A Swiss Statistician's 'Big Tent' Overview of Big Data and Data Science in Ph...Prof. Dr. Diego Kuonen
 
hariri2019.pdf
hariri2019.pdfhariri2019.pdf
hariri2019.pdfAkuhuruf
 
DWS15 - Smart City Forum - Boosting Digital Transformation - François Stephan...
DWS15 - Smart City Forum - Boosting Digital Transformation - François Stephan...DWS15 - Smart City Forum - Boosting Digital Transformation - François Stephan...
DWS15 - Smart City Forum - Boosting Digital Transformation - François Stephan...IDATE DigiWorld
 
Framework for understanding quantum computing use cases from a multidisciplin...
Framework for understanding quantum computing use cases from a multidisciplin...Framework for understanding quantum computing use cases from a multidisciplin...
Framework for understanding quantum computing use cases from a multidisciplin...Anastasija Nikiforova
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisVincenzo Gulisano
 
A Statistician's Introductory View on Big Data and Data Science (Version 7)
A Statistician's Introductory View on Big Data and Data Science (Version 7)A Statistician's Introductory View on Big Data and Data Science (Version 7)
A Statistician's Introductory View on Big Data and Data Science (Version 7)Prof. Dr. Diego Kuonen
 
challenges for the big data applications in japan
challenges for the big data applications in japanchallenges for the big data applications in japan
challenges for the big data applications in japanTakushi Otani
 
(Big) Data as the Fuel and Analytics as the Engine of the Digital Transformation
(Big) Data as the Fuel and Analytics as the Engine of the Digital Transformation(Big) Data as the Fuel and Analytics as the Engine of the Digital Transformation
(Big) Data as the Fuel and Analytics as the Engine of the Digital TransformationProf. Dr. Diego Kuonen
 
The Future Started Yesterday: The Top Ten Computer and IT Trends
The Future Started Yesterday: The Top Ten Computer and IT TrendsThe Future Started Yesterday: The Top Ten Computer and IT Trends
The Future Started Yesterday: The Top Ten Computer and IT TrendsCareer Communications Group
 
IRJET - Driving Safety Risk Analysis using Naturalistic Driving Data
IRJET - Driving Safety Risk Analysis using Naturalistic Driving DataIRJET - Driving Safety Risk Analysis using Naturalistic Driving Data
IRJET - Driving Safety Risk Analysis using Naturalistic Driving DataIRJET Journal
 
The implications of Big Data for BTS and COS
The implications of Big Data for BTS and COSThe implications of Big Data for BTS and COS
The implications of Big Data for BTS and COSGeorge Kershoff
 
Data science landscape in the insurance industry
Data science landscape in the insurance industryData science landscape in the insurance industry
Data science landscape in the insurance industryStefano Perfetti
 
Data as the Fuel and Analytics as the Engine of the Digital Transformation: D...
Data as the Fuel and Analytics as the Engine of the Digital Transformation: D...Data as the Fuel and Analytics as the Engine of the Digital Transformation: D...
Data as the Fuel and Analytics as the Engine of the Digital Transformation: D...Prof. Dr. Diego Kuonen
 
User privacy in mobility data
User privacy in mobility data User privacy in mobility data
User privacy in mobility data Chiara Renso
 
BDE SC4 Hangout - Simon Scerri, Introduction
BDE SC4 Hangout - Simon Scerri, IntroductionBDE SC4 Hangout - Simon Scerri, Introduction
BDE SC4 Hangout - Simon Scerri, IntroductionBigData_Europe
 
Foresight Analytics
Foresight AnalyticsForesight Analytics
Foresight Analyticssuresh sood
 
Overview of Big Data, Data Science and Statistics, along with Digitalisation,...
Overview of Big Data, Data Science and Statistics, along with Digitalisation,...Overview of Big Data, Data Science and Statistics, along with Digitalisation,...
Overview of Big Data, Data Science and Statistics, along with Digitalisation,...Prof. Dr. Diego Kuonen
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learningGiuseppe Manco
 
Australia bureau of statistics some initiatives on big data - 23 july 2014
Australia bureau of statistics   some initiatives on big data - 23 july 2014Australia bureau of statistics   some initiatives on big data - 23 july 2014
Australia bureau of statistics some initiatives on big data - 23 july 2014noviari sugianto
 

Similar to 1.IntroDescriptiveDisplay-20222023WS.pdf (20)

A Swiss Statistician's 'Big Tent' Overview of Big Data and Data Science in Ph...
A Swiss Statistician's 'Big Tent' Overview of Big Data and Data Science in Ph...A Swiss Statistician's 'Big Tent' Overview of Big Data and Data Science in Ph...
A Swiss Statistician's 'Big Tent' Overview of Big Data and Data Science in Ph...
 
hariri2019.pdf
hariri2019.pdfhariri2019.pdf
hariri2019.pdf
 
DWS15 - Smart City Forum - Boosting Digital Transformation - François Stephan...
DWS15 - Smart City Forum - Boosting Digital Transformation - François Stephan...DWS15 - Smart City Forum - Boosting Digital Transformation - François Stephan...
DWS15 - Smart City Forum - Boosting Digital Transformation - François Stephan...
 
Framework for understanding quantum computing use cases from a multidisciplin...
Framework for understanding quantum computing use cases from a multidisciplin...Framework for understanding quantum computing use cases from a multidisciplin...
Framework for understanding quantum computing use cases from a multidisciplin...
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
A Statistician's Introductory View on Big Data and Data Science (Version 7)
A Statistician's Introductory View on Big Data and Data Science (Version 7)A Statistician's Introductory View on Big Data and Data Science (Version 7)
A Statistician's Introductory View on Big Data and Data Science (Version 7)
 
challenges for the big data applications in japan
challenges for the big data applications in japanchallenges for the big data applications in japan
challenges for the big data applications in japan
 
(Big) Data as the Fuel and Analytics as the Engine of the Digital Transformation
(Big) Data as the Fuel and Analytics as the Engine of the Digital Transformation(Big) Data as the Fuel and Analytics as the Engine of the Digital Transformation
(Big) Data as the Fuel and Analytics as the Engine of the Digital Transformation
 
The Future Started Yesterday: The Top Ten Computer and IT Trends
The Future Started Yesterday: The Top Ten Computer and IT TrendsThe Future Started Yesterday: The Top Ten Computer and IT Trends
The Future Started Yesterday: The Top Ten Computer and IT Trends
 
IRJET - Driving Safety Risk Analysis using Naturalistic Driving Data
IRJET - Driving Safety Risk Analysis using Naturalistic Driving DataIRJET - Driving Safety Risk Analysis using Naturalistic Driving Data
IRJET - Driving Safety Risk Analysis using Naturalistic Driving Data
 
The implications of Big Data for BTS and COS
The implications of Big Data for BTS and COSThe implications of Big Data for BTS and COS
The implications of Big Data for BTS and COS
 
Data science landscape in the insurance industry
Data science landscape in the insurance industryData science landscape in the insurance industry
Data science landscape in the insurance industry
 
Data as the Fuel and Analytics as the Engine of the Digital Transformation: D...
Data as the Fuel and Analytics as the Engine of the Digital Transformation: D...Data as the Fuel and Analytics as the Engine of the Digital Transformation: D...
Data as the Fuel and Analytics as the Engine of the Digital Transformation: D...
 
User privacy in mobility data
User privacy in mobility data User privacy in mobility data
User privacy in mobility data
 
BDE SC4 Hangout - Simon Scerri, Introduction
BDE SC4 Hangout - Simon Scerri, IntroductionBDE SC4 Hangout - Simon Scerri, Introduction
BDE SC4 Hangout - Simon Scerri, Introduction
 
Foresight Analytics
Foresight AnalyticsForesight Analytics
Foresight Analytics
 
Overview of Big Data, Data Science and Statistics, along with Digitalisation,...
Overview of Big Data, Data Science and Statistics, along with Digitalisation,...Overview of Big Data, Data Science and Statistics, along with Digitalisation,...
Overview of Big Data, Data Science and Statistics, along with Digitalisation,...
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
Australia bureau of statistics some initiatives on big data - 23 july 2014
Australia bureau of statistics   some initiatives on big data - 23 july 2014Australia bureau of statistics   some initiatives on big data - 23 july 2014
Australia bureau of statistics some initiatives on big data - 23 july 2014
 
Trusted Smart Statistics
Trusted Smart StatisticsTrusted Smart Statistics
Trusted Smart Statistics
 

Recently uploaded

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 

Recently uploaded (20)

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 

1.IntroDescriptiveDisplay-20222023WS.pdf

  • 1. Welcome and Introduction Introduction to Descriptive Statistics Methods of Displaying Data Prof. Dr. Constantinos Antoniou Chair of Transportation Systems Engineering c.antoniou@tum.de Tuesday, October 18, 2022 Applied Statistics in Transport
  • 2. Prof. Dr. C. Antoniou c.antoniou@tum.de Practical information - Lecturers Mohamed Abouelela mohamed.abouelela@tum.de 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 1
  • 3. Course Topics 1. Introduction to descriptive statistics 2. Methods of displaying data 3. Probability theory and important distributions 4. Confidence intervals and sample sizes 5. Statistical testing/ hypothesis testing 6. Correlation and regression 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 2
  • 4. Credits • Some lectures rely on material from Prof. Haris N. Koutsopoulos (Northeastern University), Prof. Petros Vythoulkas, and the book Washington, Karlaftis and Mannering (2003, 2009) 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 3
  • 5. BACKGROUND AND MOTIVATION 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 4
  • 6. Why Study Probability and Statistics? 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 5
  • 7. https://xkcd.com/936/ 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 6
  • 8. Uncertainty • Values are not the same under the same conditions – Peak traffic flows – Annual rainfalls – Steel yield strengths – 911 emergency calls – Number of people served at a bank window • Variability – Important implications for • Decision making • Design • Operations • Tools for studying and dealing with uncertainty – Probability and statistics 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 7
  • 9. Wait/Walk dilemma • Waiting for a bus at a stop – Duration of the wait may exceed the time to walk to your destination • 2008 "Year in Ideas“, The New York Times Magazine – Thompson, Clive (2008-12-13). "The Bus-Wait Formula" • Wikipedia – https://en.wikipedia.org/wiki/Wait/walk_dilemma 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 8
  • 10. Source: http://www.pindropsecurity.com/data-science-how-do-we-get-started-part-one/ Explosion in Data Availability 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 9
  • 11. Information has gone from scarce to superabundant. That brings huge new benefits but also big headaches. Economist, Feb. 2010 Explosion in Data Availability Source: TomTom 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 10
  • 12. Challenges Data can be very noisy • Measurement errors • Other sources 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 11
  • 13. Big Data – the three (four, five, …) V’s • Volume: – Increasingly massive datasets hard to manage – Large Hadron Collider experiment, 150 million sensors delivering data 40 million times per second. • Variety: – Data complexity is growing – More types of data captured than ever before, quantification of self etc. 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 12
  • 14. Big Data – the three (four, five, …) V’s • Velocity: – Some data is arriving so rapidly it must be either processed instantly or lost – Whole subfield of ‘streaming data’ • Veracity? • Value? • … 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 13
  • 15. Impact of Big Data Big Data promises to revolutionize numerous areas: • Big science: ØPersonalized genomics ØMeteorology • Entertainment: Ø Netflix recommender system, $1,000,000 challenge to improve system Ø Hit show ‘House of Cards’ designed based on analysis • … 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 14
  • 16. 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 15
  • 17. Machine Learning • The massive size of Big Data sets are too large for a human to analyze • Require computers that can learn the structure and patterns in the data to extract meaningful insights and applications • Machine learning and Big Data are inextricably linked • ML hard to define: contains Elements of Artificial Intelligence, Statistics, Computer Science, Control Theory, Engineering 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 16
  • 18. So what? • How can data help plan, manage and operate transportation systems? 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 17
  • 19. Skills needed in data science [National Institute of Standards (NIST)] Source: NIST Big Data. "Draft NIST Big Data Interoperability Framework, Volume 1", 2014. http://docplayer.net/7239072-Draft-nist-big-data-interoperability-framework-volume-1-definitions.html. 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 18
  • 20. Data Science Source: https://en.wikipedia.org/wiki/Data_science 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 19
  • 21. Data Science 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 20
  • 22. CRISP-DM Process Model for Data Mining Source:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.198.5133&rep=rep1&type=pdf 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 21
  • 23. Source: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.198.5133&rep=rep1&type=pdf CRISP-DM Tasks and their Outputs 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 22
  • 24. The Human Centered KDD Process and the SEMMA Methodological Steps Source: Mariscal, Gonzalo, Oscar Marban, and Covadonga Fernandez. "A survey of data mining and knowledge discovery process models and methodologies."The Knowledge Engineering Review 25.02 (2010): 137-166 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 23
  • 25. The SEMMA Model Development Process Source: http://www.sas.com/content/dam/SAS/en_gb/doc/other1/events/sasforum/slides/manchester- day2/I.%20Brown%20Data%20Exploration%20and%20Visualisation%20in%20SAS%20EM_IB.pdf 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 24
  • 26. Guide to Analytic Selection (Booz Allen & Hamilton) Source: http://www.boozallen.com/insights/2015/12/data-science-field-guide-second-edition. 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 25
  • 27. Degree of Intelligence in Data Analytics Source: Adapted from: Davenport, Thomas H., and Jeanne G. Harris. Competing on analytics: The new science of winning. Harvard Business Press, 2007 Analysis 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 26
  • 28. Data preparation • Our data has to follow our assumptions for x and y • All sorts of little tasks – Parse datasets – Convert value types (e.g. numeric to nominal) – Eliminate errors, (useless) outliers – Obtain intermediate values (e.g. xn+1=f(x1,x2)) • Descriptive statistics • This is where we spend MOST of the time! Some people say 90%... 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 27
  • 29. Data analysts - the bad news 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 28
  • 30. Data analysts - the “good” news 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 29
  • 31. PROBABILITY VS. STATISTICS 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 30
  • 32. Probability • Probability is a numerical measure of the likelihood that an event will occur • Probability values are always assigned on a scale from 0 to 1 • A probability near 0 indicates an event is very unlikely to occur • A probability near 1 indicates an event is almost certain to occur 0 1 .5 Increasing Likelihood of Occurrence Probability: The occurrence of the event is just as likely as it is unlikely 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 31
  • 33. Humans are very bad at understanding probabilities Financial Crisis 2008.10.15 -7.87% Great Depression Black Monday 1987.10.19 -22.61% Financial Crisis 2008.10.13, 10.28 +11.09%, +10.88 Great Depression 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 32
  • 34. We Fear Spectacular, Unlikely Events • “After 9/11, 1.4 million people changed their holiday travel plans to avoid flying. The vast majority chose to drive instead. • But driving is far more dangerous than flying, and the decision to switch caused roughly 1,000 additional auto fatalities, according to two separate analyses comparing traffic patterns in late 2001 to those the year before. • In other words, 1,000 people who chose to drive wouldn't have died had they flown instead.” https://www.psychologytoday.com/articles/200801/10-ways-we-get-the-odds-wrong 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 33
  • 35. We roll 6 (fair) dice Which of the following sequences is more likely to occur? • 1-1-1-1-1-1 • 1-2-3-4-5-6 • 4-3-6-5-4-2 • 2-3-6-4-5-1 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 34
  • 36. Gambler’s fallacy (Monte Carlo, 1913) • The gambler's fallacy, also known as the Monte Carlo fallacy or the fallacy of the maturity of chances, is the mistaken belief that, if something happens more frequently than normal during some period, it will happen less frequently in the future • In a game of roulette at the Monte Carlo Casino on August 18, 1913, when the ball fell in black 26 times in a row (an extremely uncommon occurrence, although no more or less common than any of the other 67,108,863 sequences of 26 red or black), gamblers lost millions of francs betting against black, reasoning incorrectly that the streak was causing an "imbalance" in the randomness of the wheel, and that it had to be followed by a long streak of red. https://en.wikipedia.org/wiki/Gambler%27s_fallacy 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 35
  • 37. Statistics • The analysis of data for the purpose of reaching a decision or gaining insight into the behavior of many phenomena • Examples – Weather – Pollution/contamination – Soil condition – Traffic – Marketing – Design of facilities 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 36
  • 38. Probability and Statistics Probability is deductive Probability: Given the information in the box, what is in your hand? Statistics: Given the information in your hand, what is the box? Statistics is inductive 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 37
  • 39. Inferential Statistics Population Sample Probability Inferential Statistics 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 38
  • 40. Populations and Samples A population is a well-defined collection of objects or units of interest A subset of the population is a sample Sample 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 39
  • 41. Reasons for Drawing a Sample • Less time consuming • … • … Sample 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 40
  • 42. Reasons for Drawing a Sample • Less time consuming • Less costly to administer • Less cumbersome and more practical Sample Sample 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 41
  • 43. Types of Data A variable is discrete if its set of possible values constitute a finite set or an infinite sequence A variable is continuous if its set of possible values consists of an entire interval on a number line 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 42
  • 44. Statistics Descriptive Statistics – summary and description of collected data Inferential Statistics – generalizing from a sample to a population 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 43
  • 45. Presenting Data • For data to be useful, it must be summarized – Tables – Graphs and charts • Visualization 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 44
  • 46. Visualization • Floating car data in Stockholm • Public transportation passenger movements in London • Profiles of public transport users 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 45
  • 47. Visualization: Floating car data (FCD) in Stockholm 18/10/2022 46
  • 48. Visualization: Public Transport Use in London http://jaygordon.net 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 47
  • 49. 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 48
  • 50. Visualisation – increased needs • Traditionally it was “easy” to look at the model inputs and outputs • Interpretation and analysis • To understand Big data we need a lot of work and the development of new strategies 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 49
  • 51. MBTA visualisation 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 50
  • 52. Simple multivariate visualisations 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 51
  • 53. Interesting multivariate visualisations 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 52
  • 54. Virtual/augmented reality CAVE (no need for glasses) • LRZ Virtual Reality and Visualisation Centre (V2C) • LRZ Holobench More accessible technologies • Oculus Rift, etc. • Upcoming versions will not require powerful computer 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 53
  • 55. CORRELATION VS. CAUSATION 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 54
  • 56. Correlation • The relationship between things that happen or change together (Merriam-Webster Dictionary) 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 55
  • 57. Correlation • The relationship between things that happen or change together (Merriam-Webster Dictionary) The Redskins moved to Washington, DC in 1937. Since then, there have been 19 presidential elections. In 17 of those, the following applied: "If the Redskins win their last home game before the election, the party that won the previous election wins the next election and that if the Redskins lose, the challenging party's candidate wins.“ Source: https://en.wikipedia.org/wiki/Redskins_Rule 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 56
  • 58. Correlation vs. Causality • Correlation – The relationship between things that happen or change together (Merriam-Webster Dictionary) • Causality – The relationship between something that happens or exists and the thing that causes it 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 57
  • 59. Spurious regression examples (1) http://www.tylervigen.com/spurious-correlations Hanging suicides US spending on science US spending on science, space, and technology correlates with Suicides by hanging, strangulation and suffocation Hanging suicides US spending on science 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 6000 suicides 8000 suicides 4000 suicides 10000 suicides $15 billion $20 billion $25 billion $30 billion tylervigen.com 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 58
  • 60. Spurious regression examples (2) http://www.tylervigen.com/spurious-correlations Nicholas Cage Swimming pool drownings Number of people who drowned by falling into a pool correlates with Films Nicolas Cage appeared in Nicholas Cage Swimming pool drownings 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 0 films 2 films 4 films 6 films 80 drownings 100 drownings 120 drownings 140 drownings tylervigen.com 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 59
  • 61. Spurious regression examples (3) http://www.tylervigen.com/spurious-correlations Bedsheet tanglings Cheese consumed Per capita cheese consumption correlates with Number of people who died by becoming tangled in their bedsheets Bedsheet tanglings Cheese consumed 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 200 deaths 400 deaths 600 deaths 800 deaths 28.5lbs 30lbs 31.5lbs 33lbs tylervigen.com 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 60
  • 62. Spurious regression examples (4) http://www.tylervigen.com/spurious-correlations Murders by steam Age of Miss America Age of Miss America correlates with Murders by steam, hot vapours and hot objects Murders by steam Age of Miss America 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2 murders 4 murders 6 murders 8 murders 18.75 yrs 20 yrs 21.25 yrs 22.5 yrs 23.75 yrs 25 yrs tylervigen.com 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 61
  • 63. Spurious regression examples (5) http://www.tylervigen.com/spurious-correlations Number of people killed by venomous spiders Spelling Bee winning word Letters in Winning Word of Scripps National Spelling Bee correlates with Number of people killed by venomous spiders Number of people killed by venomous spidersSpelling Bee winning word 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 0 deaths 5 deaths 10 deaths 15 deaths 5 letters 10 letters 15 letters tylervigen.com 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 62
  • 64. Bradford-Hill criteria • The Bradford Hill criteria, otherwise known as Hill's criteria for causation, are a group of minimal conditions necessary to provide adequate evidence of a causal relationship between an incidence and a possible consequence, established by the English epidemiologist Sir Austin Bradford Hill (1897–1991) in 1965 • Strength • Consistency • Specificity • Temporality • Biological gradient • Plausibility • Coherence • Experiment • Analogy https://en.wikipedia.org/wiki/Bradford_Hill_criteria 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 63
  • 65. GRAPHICAL DATA REPRESENTATION 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 64
  • 66. Graphical Data Representation • Stem-and-leaf displays • Dotplots • Histograms 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 65
  • 67. Stem-and-Leaf Example 9, 11, 10, 15, 22, 9, 15, 16, 24 Observed values: 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 66
  • 68. • Select one or more leading digits for the stem values. The trailing digits become the leaves • List stem values in a vertical column • Record the leaf for every observation • Indicate the units for the stem and leaf on the display • Displays with 5 and 20 stems recommended Stem-and-Leaf Example 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 67
  • 69. 9, 11, 10, 15, 22, 9, 15, 16, 24 Observed values: 0 9 9 1 0 1 5 5 6 2 2 4 Stem: tens digit Leaf: units digit Stem-and-Leaf Example 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 68
  • 70. Stem-and-Leaf Example • 40 golf courses were sampled for their lengths. The range was from 6433 to 7280 • 7051 6470 6526 6527 6583 6694 7209 6614 6790 6770 6700 6770 6713 6870 6873 6850 6900 6927 6464 6904 7005 6433 7280 6890 7131 6605 7169 7168 7105 7113 7165 7011 6506 7040 6798 7050 6745 7022 6435 6936 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 69
  • 71. Stem-and-Leaf Example 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 70
  • 72. Stem-and-Leaf Example 64 35 64 33 70 65 26 27 06 83 66 05 94 14 67 90 70 00 98 70 45 13 68 90 70 73 50 69 00 27 36 04 70 51 05 11 40 50 22 71 31 69 68 05 13 65 72 80 09 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 71
  • 73. • Identify typical value • Extent of spread about a value • Presence of gaps • Extent of symmetry • Number and location of peaks • Presence of outlying values Stem-and-Leaf Displays 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 72
  • 74. Dotplots 9, 10, 15, 22, 9, 15, 16, 24,11 Observed values: • Represent data with dots above horizontal measurement scale • Attractive for small data sets and few distinct data values 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 73
  • 75. 9, 10, 15, 22, 9, 15, 16, 24,11 Observed values: 5 10 15 20 25 Dotplots • Represent data with dots above horizontal measurement scale • Attractive for small data sets and few distinct data values 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 74
  • 76. Histograms: Discrete Data • Determine the frequency and relative frequency for each value of x. • Mark possible x values on a horizontal scale. Above each value, draw a rectangle whose height is the (relative) frequency of that value. 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 75
  • 77. Histograms: Discrete Variables • Frequency of a value – The number of times the value occurs in the data set • Relative frequency – The proportion of times the value occurs • Sum of relative frequencies of all values in sample is 1 set data the in ns observatio of number occurs value the times of number value a of frequency relative = 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 76
  • 78. Choosing a suitable bin size https://statistics.laerd.com/statistical-guides/understanding-histograms.php 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 77
  • 80. Histograms • Histograms are sensitive to number of intervals – Try different values – Sturges’ rule – Scott’s rule – Rule of thumb frequency Imitate Evaluate X Interval width 1 3.322log k n = + 0.333 1.667 k n = ns observatio of number classes of number » 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 79
  • 81. Ex. Students were asked how many train trips they did last week. x is the variable representing the number of trips and the results are below. x #people 0 12 1 42 2 57 3 24 4 9 5 4 6 2 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 80
  • 82. x #people 0 12 1 42 2 57 3 24 4 9 5 4 6 2 Rel. Freq 0.08 0.28 0.38 0.16 0.06 0.03 0.01 Frequency Distribution Ex. Students were asked how many train trips they did last week. x is the variable representing the number of trips and the results are below. 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 81
  • 83. Histograms x Rel. Freq. 0 0.08 1 0.28 2 0.38 3 0.16 4 0.06 5 0.03 6 0.01 Trip results: 0 0,1 0,2 0,3 0,4 0 1 2 3 4 5 6 Number of Cards Relative Frequency Number of Trips 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 82
  • 84. Histograms with Continuous Data: Equal Class Widths • Determine the class size • Determine the frequency and relative frequency for each class. • Mark the class boundaries on a horizontal measurement axis. Above each class interval, draw a rectangle whose height is the relative frequency. 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 83
  • 85. Guidelines: Number of classes – More classes in larger data sets – Between 5 and 20 – Rule of thumb – Classes of usually equal length ns observatio of number classes of number » 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 84
  • 86. Histogram Shapes symmetric unimodal bimodal positively skewed negatively skewed 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 85
  • 87. Histograms with Qualitative Data • Histograms can be used even with qualitative data • Example: rating of K-12 education in California (survey data) 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 86
  • 88. Boxplots • Graphical representation of dispersion, skewness, outliers and other prominent features in data using quartiles • Construction • Order the n observations in a data set from smallest to largest • Separate the smallest half and the largest half (the median ! 𝑥 is included in both halves if n is odd) • Find the lower fourth (median of the smallest half of the data) • Find the highest fourth (median of the largest half of the data) • Find the fourth spread fs (a measure of the spread, resistant to outliers, fs = upper fourth – lower fourth median upper fourth lower fourth fs min max 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 87
  • 89. 30 40 50 60 70 80 90 100 110 120 130 Median = 90 upper fourth = 96.5 lower fourth = 72.5 Boxplot Example median upper fourth lower fourth 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 88
  • 90. Finding Outliers with Boxplots Any observation farther than 1.5fs from the closest fourth is an outlier. An outlier is extreme if it is more than 3fs from the nearest fourth, and it is mild otherwise. 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 89
  • 91. Boxplots and Outliers median extreme outliers upper fourth lower fourth mild outliers fs ≥1.5fs ≥ 3fs 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 90
  • 92. 40 52 55 60 70 75 85 85 90 90 92 94 94 95 98 100 115 125 125 Median n = 19 Boxplot Example 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 91
  • 93. 40 52 55 60 70 75 85 85 90 90 92 94 94 95 98 100 115 125 125 Outliers Example Median of smaller half: Median of higher half: 72.5 96.5 Median n = 19 fs = 1.5fs = 3.0fs = 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 92
  • 94. 40 52 55 60 70 75 85 85 90 90 92 94 94 95 98 100 115 125 125 Median of smaller half: Median of higher half: 72.5 96.5 Median n = 19 fs = 96.5 – 72.5 = 24 1.5fs = 36 3.0fs = 72 Outliers Example 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 93
  • 95. Types of Boxplots upper fourth lower fourth median Min (non outlier) Max (non outlier) • Whiskers representing • the minimum and maximum of all of the data, or • the lowest datum still within 1.5fs of the lower quartile, and the highest datum still within 1.5fs of the upper quartile (Tukey boxplot) 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 94
  • 96. NUMERICAL REPRESENTATION OF DATA Measures of location Measures of variability and dispersion 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 95
  • 97. Measures of Location Location Mean Median Mode Quartiles, Percentiles 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 96
  • 98. Summary Measures Location Mean Median Mode Quartile Geometric Mean Summary Measures Variation Variance and standard Deviation Coefficient of Variation Range 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 97
  • 99. Measures of Central Tendency Central Tendency Average Median Mode Geometric Mean 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 98
  • 100. Measures of Variation Variation Variance Standard deviation Coefficient of Variation Population Variance Sample Variance Population Standard Deviation Sample Standard Deviation Range 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 99
  • 101. The Mean The average (mean) of the numbers 1 2 , ,..., n x x x 1 2 ... n x x x x n + + + = 1 n i i x n = = å 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 100
  • 102. Mean Value and Frequency Distributions x #people 0 12 1 42 2 57 3 24 4 9 5 4 6 2 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 101
  • 103. The median, is the middle value in a set of data that is arranged in ascending order. For an even number of data points the median is the average of the middle two. Median 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 102
  • 104. Example • Sample median = 8.9 9.6 10.3 11.8 12.7 14.0 14.6 16.1 18.5 20.4 21.2 23.3 24.2 24.9 25.3 25.8 27.1 28.5 30.2 32.4 45.0 9.6 16.1 24.9 20.4 12.7 21.2 30.2 25.8 18.5 10.3 25.3 14.0 27.1 45.0 23.3 24.2 14.6 8.9 32.4 11.8 28.5 sort 8.9 9.6 10.3 11.8 12.7 14.0 14.6 16.1 18.5 20.4 21.2 23.3 24.2 24.9 25.3 25.8 27.1 28.5 30.2 32.4 45.0 21.2 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 103
  • 105. Mean vs. median: does it really matter? Credit Suisse Global Wealth Report 2015 https://www.credit-suisse.com/media/mediarelease-assets/pdf/2015/10/gwr-2015-global-press-release.pdf 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 104
  • 106. Mean vs. median: does it really matter? Credit Suisse Global Wealth Report 2015 https://www.credit-suisse.com/media/mediarelease-assets/pdf/2015/10/gwr-2015-global-press-release.pdf 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 105
  • 107. Which is larger? • Can we say something about which would be larger? Median or mean? – Without any other information! • Why? 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 106
  • 108. Which is larger? • Can we say something about which would be larger? Median or mean? – Without any other information! • Why? • Mean will in general be larger, because it can be inflated by outliers. • However, this is not always true, see e.g. the discussion in https://ww2.amstat.org/publications/jse/v13n2/vonhippel.html 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 107
  • 109. Other Measures of Location • Median divides data set into two parts of equal size • Quartiles divide the data set into 4 equal parts • Percentiles divide the data set into even finer parts, e.g. 99% 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 108
  • 110. Outliers • Outliers: observations with extreme values • Mean: sensitive to outliers • Median: not • Trimmed mean • 10% trimmed: mean after eliminating the smallest 10% and largest 10% of values 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 109
  • 111. Example: Impact of Outliers 0 623 666 744 883 898 964 970 983 1003 1016 1022 1029 1058 1085 1088 1122 1135 1197 1201 612 623 666 744 883 898 964 970 983 1003 1016 1022 1029 1058 1085 1088 1122 1135 1197 1201 • Mean = 19299/20 = 965.0 • Median = 1009.5 • Mean = 18687/20 = 934.35 • Median = 1009.5 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 110
  • 112. Example: Impact of Outliers • Mean = 19299/20 = 965.0 • Median = 1009.5 • Mean = 24687/20 = 1234.4 • Median = 1009.5 612 623 666 744 883 898 964 970 983 1003 1016 1022 1029 1058 1085 1088 1122 1135 1197 1201 0 623 666 744 883 898 964 970 983 1003 1016 1022 1029 1058 1085 1088 1122 1135 1197 7201 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 111
  • 113. Example: Trimmed Mean • Mean = 19299/20 = 965.0 • Median = 1009.5 10% Trimmed Mean = 15666/16 = 979.12 612 623 666 744 883 898 964 970 983 1003 1016 1022 1029 1058 1085 1088 1122 1135 1197 1201 0 623 666 744 883 898 964 970 983 1003 1016 1022 1029 1058 1085 1088 1122 1135 1197 7201 612 623 666 744 883 898 964 970 983 1003 1016 1022 1029 1058 1085 1088 1122 1135 1197 1201 with outliers 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 112
  • 114. Measures of Variability Variability Variance Standard Deviation Coefficient of Variation Population Sample Range Boxplots (graphical representation) 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 113
  • 115. Measures of Variability • Range – Let xmin the minimum value in the sample, and xmax the maximum. The range is defined as: range = xmax - xmin • Deviation from the mean – Let xi be a value in the sample. Its deviation is defined as: deviation = mean - xi 1 4 3 2 7 6 5 deviation 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 114
  • 116. Sample Variance • Variance is a measure of the spread of the data • The sample variance of the sample x1, x2, …xn of n values of X s2 has n – 1 degrees of freedom* *This makes this sample variance an unbiased estimator for the population variance. For more information, see appendix. 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 115
  • 117. Standard Deviation Standard deviation is a measure of the spread of the data using the same units as the data. The sample standard deviation is the square root of the sample variance: 𝑠 = 𝑠# 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 116
  • 118. Example • Mean = • Variance = 0 10 20 30 40 50 60 0 1 2 3 4 5 6 Number of credit cards Numbedr of holders x # 0 8 1 15 2 26 3 53 4 25 5 16 6 7 Number of trips 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 117
  • 119. Example • Mean = = 2.993 • Variance = 19 20 21 22 23 0 1 2 3 4 5 6 Number of cards Number of holders x # 0 21 1 22 2 22 3 21 4 21 5 22 6 21 Number of trips 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 118
  • 120. Example • Mean = 2.987 • Variance = 2.094 • Standard deviation = 1.447 19 20 21 22 23 0 1 2 3 4 5 6 Number of cards Number of holders 0 10 20 30 40 50 60 0 1 2 3 4 5 6 Number of credit cards Numbedr of holders • Mean = 2.993 • Variance = 4.00 • Standard deviation = 2.00 Number of trips Number of trips 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 119
  • 121. Comparing Standard Deviations Data A Data B Data C Mean = 15.5 s = Mean = 15.5 s = Mean = 15.5 s = 11 12 13 14 15 16 17 18 19 20 21 11 12 13 14 15 16 17 18 19 20 21 11 12 13 14 15 16 17 18 19 20 21 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 120
  • 122. Properties of Standard Deviations 1 4 3 2 7 6 5 Add 1 to all values. What happens to the mean? to the standard deviation? 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 121
  • 123. Properties of Standard Deviations 1 4 3 2 7 6 5 1 4 3 2 7 6 5 Add 1 to all values. What happens to the mean? to the standard deviation? 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 122
  • 124. Properties of Standard Deviations 1 4 3 2 7 6 5 Multiply all values by 2. What happens to the mean? to the standard deviation? 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 123
  • 125. Properties of Standard Deviations 1 4 3 2 7 6 5 Multiply all values by 2. What happens to the mean? to the standard deviation? 1 4 3 2 7 6 5 8 9 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 124
  • 126. Properties of s2 Let x1, x2,…, xn be any sample and c be any nonzero constant. 2 2 1 1 1. If ,..., , then n n y x y x c y x c s s = + = + = 2 2 2 1 1 2. If ,..., , then , n n y x y cx y cx s c s = = = 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 125
  • 127. Population Variance • Finite population of size N • Variance s2 • μ: population mean 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 126
  • 128. Coefficient of Variation, CV Measures variance relative to the mean When only a sample of data from a population is available, the population CV can be estimated using the ratio of the sample standard deviation 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 127
  • 129. You may also hear of “moments” • 1st moment – sample mean • 2nd moment – variance • 3rd moment – skewness • 4th moment – kurtosis • … 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 128
  • 130. DOS AND DON’TS OF VISUALIZATION 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 129
  • 131. RMSN v v_front D_front density_all RMSN v v_front D_front density_all 0 5 10 Corr: −0.599 Corr: −0.591 Corr: −0.412 Corr: 0.141 5 10 15 Corr: 0.898 Corr: 0.659 Corr: −0.341 5 10 15 Corr: 0.689 Corr: −0.336 5 10 15 20 25 Corr: −0.0127 2 3 4 5 0.0 0.3 0.6 0.9 5 10 15 5 10 15 5 10 15 20 25 2 3 4 5 Papathanasopoulou and Antoniou (2016) 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 130
  • 132. Tyrinopoulos and Antoniou (2014) 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 131
  • 133. Parallel coordinates 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 132
  • 134. Visualizing Clusters with Parallel Coordinates Source: http://vis.pku.edu.cn/wiki/project/hdvis 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 133
  • 135. 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 134
  • 136. Heatmaps – reordering the rows 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 135
  • 137. Hierarchical clustering dendrogram 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 136
  • 138. Heatmap vs. Parallel Coordinates (for the Same Data) Source: Gehlenborg, Nils, and Bang Wong. "Points of view: Heat maps." Nature Methods 9.3 (2012): 213-213. 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 137
  • 140. Bad visualization examples (“chartjunk”) 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 139
  • 141. What could we have done instead? ● ● ● ● 10 20 30 40 50 about to do it almost doing it frenching (which will lead to doing it) totally doing it Degree of doing it ● ● ● ● 0 20 40 about to do it almost doing it frenching (which will lead to doing it) totally doing it Degree of doing it 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 140
  • 142. And how did we do it? library(ggplot2) xx=c(5,18,20,57) x1=c("almostndoing it", "frenchingn(which willnlead tondoing it)", "totallyndoing it", "about tondo it") qplot(x=x1,y=xx, xlab="", ylab="Degree of doing it”) ggsave("dotexample1.pdf",width=6,height=4) qplot(x=x1,y=x, xlab="", ylab="Degree of doing it")+ geom_bar(stat = "identity") ggsave(”barexample1.pdf",width=6,height=4) 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 141
  • 143. What is the (main) problem with this figure? http://andrewgelman.com/2015/06/22/hey-whats-up-with-that-x-axis/ 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 142
  • 144. Fixed, shows a totally different trend http://andrewgelman.com/2015/06/22/hey-whats-up-with-that-x-axis/ 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 143
  • 145. 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 144
  • 147. https://speakingpowerpoint.files.wordpress.com/2011/05/chartjunk-stockings1.jpg Which one is better? (and what does better mean?) 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 146
  • 148. What about these? http://hci.usask.ca/uploads/173-pap0297-bateman.pdf 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 147
  • 149. http://hci.usask.ca/uploads/173-pap0297-bateman.pdf And these? 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 148
  • 150. Is there something there? http://junkcharts.typepad.com 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 149
  • 151. Is there something there? http://junkcharts.typepad.com 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 150
  • 153. Why not use pies? http://www.unece.org/fileadmin/DAM/stats/documents/writi ng/MDM_Part2_English.pdf (p. 23) 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 152
  • 154. Why not use pies? http://www.eea.europa.eu/data-and-maps/daviz/learn- more/chart-dos-and-donts 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 153
  • 155. Sort your data http://www.unece.org/fileadmin/DAM/stats/documents/writi ng/MDM_Part2_English.pdf (p. 27) 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 154
  • 156. Avoid unnecessary graphic features http://www.unece.org/fileadmin/DAM/stats/documents/writi ng/MDM_Part2_English.pdf (p. 29) 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 155
  • 157. Tufte’s principles • Keep a high data-ink ratio • Remove chart junk • Give graphical element multiple functions • Keep in mind the data density • The term to search for is Information Visualization 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 156
  • 158. 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 157
  • 159. Good Practices • Clearly labeled – Title – general subject – Label all variables – Units should be specified • Identify Source of Data • Date Data 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 158
  • 160. Some of the things to keep in mind • Figure may (will) be printed in black and white and/or photocopied • More people are colorblind than we often think! • Projectors are not as bright as your screen • … • Less is more! 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 159
  • 161. References (figures) • http://www.unece.org/fileadmin/DAM/stats/documents/writi ng/MDM_Part2_English.pdf • http://www.eea.europa.eu/data-and-maps/daviz/learn- more/chart-dos-and-donts 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 160
  • 162. APPENDIX 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 161
  • 163. Motivation for n-1 in s2 • To explain the rationale for the divisor n – 1 in s2, note first that whereas s2 measures sample variability, there is a measure of variability in the population called the population variance. • We will use s2 (the square of the lowercase Greek letter sigma) to denote the population variance and s to denote the population standard deviation (the square root of s2). 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 162
  • 164. Motivation for s2 • When the population is finite and consists of N values, • which is the average of all squared deviations from the population mean (for the population, the divisor is N and not N – 1). • Just as will be used to make inferences about the population mean µ, we should define the sample variance so that it can be used to make inferences about s2. Now note that s2 involves squared deviations about the population mean µ. 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 163
  • 165. Motivation for s2 • If we actually knew the value of µ, then we could define the sample variance as the average squared deviation of the sample xis about µ. • However, the value of µ is almost never known, so the sum of squared deviations about must be used. • But the xis tend to be closer to their average than to the population average µ, so to compensate for this the divisor n – 1 is used rather than n. 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 164
  • 166. Motivation for s2 • In other words, if we used a divisor n in the sample variance, then the resulting quantity would tend to underestimate s2 (produce estimated values that are too small on the average), whereas dividing by the slightly smaller n – 1 corrects this underestimation. • It is customary to refer to s2 as being based on n – 1 degrees of freedom (df). This terminology reflects the fact that although s2 is based on the n quantities these sum to 0, so specifying the values of any n – 1 of the quantities determines the remaining value. 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 165
  • 167. Motivation for s2 • For example, if n = 4 and then automatically so only three of the four values of are freely determined (3 df). 18/10/2022 Prof. Dr. Constantinos Antoniou | Applied Statistics in Transport 166