The goal of this presentation is to share exemplars of important broadband Internet access performance phenomena. In particular, we highlight the critical role of stationarity.
When they have non-stationarity, networks are useless for most applications. We show real-world examples of both stationarity and non-stationarity, and discuss the implications for broadband stakeholders.
These phenomena are only visible when using state-of-the-art high-fidelity metrics and measures that capture instantaneous flow.
3. 3
The purpose of this presentation
• Our goal is to share exemplars of important broadband Internet access
performance phenomena.
• This is for learning and training purposes only. They are not meant to
be exhaustive or even representative of Internet performance issues.
• In particular, we highlight the critical role of stationarity. When they
have non-stationarity, networks are useless for most applications.
• We show real-world examples of both stationarity and non-
stationarity, and discuss the implications for broadband stakeholders.
• These phenomena are only visible when using state-of-the-art
high-fidelity metrics and measures that capture instantaneous flow.
4. 4Stationarity — The most important
networking term you’ve never heard of
• All network and distributed application control protocols depend on
the statistical stability of the network, i.e. stationarity.
• ‘Stationarity’ is a standard term in statistics, see its definition here.
• When the past and future are strongly related, then the protocols can
successfully predict the future from the past and act ‘sensibly’.
• When the past and the future become unrelated, the protocols make
too many ‘bad guesses’.
• This non-stationarity causes congestion control systems and codecs
to act ‘stupidly’ and break down.
• There is no clever protocol or application ‘fix’ for non-stationarity.
5. 5
Initial example: Satellite in Asia
Stationary
Good for web browsing
Non-stationary
Poor for web browsing
Same satellite, same location, similar time, different services
6. Where do these examples of
Internet non-stationarity come from?
The world’s only network performance science company
www.pnsol.com
7. 7
An important caveat
• Typical ‘best effort’ broadband Internet access services lack intentional
semantics for performance; that is to say, it is emergent.
• That means they legitimately can do anything! (And they often will!)
• Hence these phenomena are not (necessarily) ‘faults’, but are simply
“how the Internet works” (or doesn’t, as the case may be).
• These type of phenomena are widespread, and are the result of
systemic issues with network architecture, protocols and operation.
• As a consequence, no blame or negative publicity should be
attached to the specific countries, operators or bearer technologies
concerned.
8. 8
G
S
V
Geographic delay
Size of packet delay
Variable delay due to load
Key terminology used: G, S and V
The components of packet delay
Packet size
One-waydelay
11. 11
Help me to understand…
…what it shows? …what it means? …what to do about it?
There are sudden big ‘jumps’
in the delay where the
upstream takes a ‘service
holiday’ and stops processing
packets for ~10 seconds. As
a result a queue builds up:
the diagonal slope is the
queue emptying. This ’jump’
is non-stationarity.
We don’t know the reason,
as it could be the customer
premises equipment (CPE) or
the line. Our suspicion it is
the former, with the CPE
becoming preoccupied with
some task other than packet
processing. But it could be
DSLAM cross-talk between
copper lines affecting the
transmission protocol.
The service doesn’t have
intentional semantics, so this
is not strictly a ‘fault’. The
industry needs to gain the
operational capability to see
this happening. That means
adopting high-fidelity
metrics, and working
together to mature and scale
these state-of-the-art
measurement systems.
13. 13
Help me to understand…
…what it shows? …what it means? …what to do about it?
Short bursts of ‘weirdness’ in
the upstream. These sudden
transitions are an examples
of non-stationarity. This
general kind of “weird stuff
happens!” pattern shows up
in ~1% of the experiment
runs, across all technologies
we have tested.
A ‘speed test’ wouldn’t be
affected by this, but
interactive applications
would be affected.
This kind of anomaly is the
result of unpredictable
performance of TCP/IP and
current flow control
protocols and scheduling
algorithms.
This kind of data would be
lost in the usual averaged
performance metrics.
Operators and regulators
need to not only increase the
fidelity of their metrics, but
also need to be able to
isolate which direction the
issue is happening in.
15. 15
Help me to understand…
…what it shows? …what it means? …what to do about it?
We see a burst of activity and
then a 100 second outage.
Obviously the outage affects
any application’s ability to
function. The transitions
from “working” to “not
working” are non-
stationarity that will cause
adaptive protocols to
malfunction.
When you buy “best effort”
this is what you get – no
lower bound on quality
whatsoever. You would have
no real grounds for
complaint that this service is
not what you paid for.
The industry needs to move
from a network-centric
viewpoint to a user-centric
one. What matters is
fitness-for-purpose, and the
metrics and measures need
to reflect the user
perspective.
17. 17
Help me to understand…
…what it shows? …what it means? …what to do about it?
Here we see extreme levels
of downstream non-
stationarity. There is
no obvious structure. The
upstream and downstream
are very different (not
shown).
This is an ISP service whose
performance has essentially
collapsed. It is not usable for
most interactive
applications. However, a
speed test might well return
a perfectly acceptable result
(certainly in one direction,
and possibly both)!
The regulatory system needs
to differentiate between a
service that is available in the
network’s terms (packets
are flowing) and the user’s
terms (usable for desired
applications).
19. 19
Help me to understand…
…what it shows? …what it means? …what to do about it?
This is a ‘classic’ pattern of
the network being
overdriven, and a queue
suddenly forming. The
‘upslope’ is very sleep, and
the dissipation of the queue
in the ‘downslope’ is much
slower. The sudden
variability is non-
stationarity.
This is what is colloquially
known as ‘bufferbloat’. It is
the result of poor choices of
scheduling and resource
control in networks. This
problem is extremely
widespread.
The solution to this problem
is to schedule traffic better
and avoid over-saturation of
resources. Unfortunately, the
chosen means of scheduling
by the broadband industry
(Active Queue Management)
merely shifts the symptoms
around, rather than truly
addressing its root
engineering causes.
21. 21
Help me to understand…
…what it shows? …what it means? …what to do about it?
This is a more extreme
version of the previous
example, with large queues
forming and slowly draining
creating very high non-
stationarity. Some of the
gaps indicate many of the
test packets are being lost.
(The losses are being
recorded, just not shown.)
When we compose different
access media (in this case
two different wireless
systems), their performance
interacts in ways that may
not be desirable or under the
control of the ISP.
The reality is that methods
like tethering are frequently
used by end users. Both the
network architects and
regulatory system need to
take account of these modes
of use in their design and
operation.
23. 23
Help me to understand…
…what it shows? …what it means? …what to do about it?
Every 60 seconds there is a
sudden non-stationary
burst in the upstream of
around 0.5 seconds, with a
corresponding burst in the
downstream (not shown) of
around 0.2 seconds.
This would appear to be
some load-related issue,
since it is bidirectional and
shows up as V. Some
application is presumably
waking up every minute and
suddenly applying a heavy
brief load that over-saturates
the link resources.
There is a lack of
performance isolation of
applications on both the
customer network and the
ISP service. The
measurement system needs
to be able to differentiate
between these causes with
probes at the hand-off point.
25. 25
Help me to understand…
…what it shows? …what it means? …what to do about it?
Every 10 seconds there is a
short non-stationary burst
of delay, in both directions,
seen as vertical ‘stacks’.
This is a normal part of the
operation of Wi-Fi networks,
which typically scan every 10
seconds. When the radio is
busy doing one thing
(scanning), it cannot be
doing another thing
(transmitting).
Plug directly into your router
over a fixed Ethernet cable if
it’s a problem. For regulators,
separating out the
contribution of the home
network from the ISP
network requires suitable
boundary probe measures.
27. 27
Help me to understand…
…what it shows? …what it means? …what to do about it?
In this case the network is
highly stationary. However,
it also has a high base delay
of over 25ms.
The high base delay means
that a significant proportion
of the ‘quality budget’ for
things like long-distance VoIP
is already used up; your
Skype calls to New Zealand
from US/Europe may not
work. Stationarity is
necessary, but not sufficient,
for applications to perform
to the standard required.
We can only begin to
optimise networks once we
have a baseline of
stationarity. Otherwise, we
have no stable properties
from which to determine
cause and effect. ISP
services need to define
their stable ‘quality floor’,
which is a proxy for their
fitness-for-purpose.
29. 29
Help me to understand…
…what it shows? …what it means? …what to do about it?
In this chart, we have plotted
the delay against the packet
size, rather than against
time. We can see a sudden
jump at around 1300 bytes.
This is another form of non-
stationarity in the
distribution of performance.
This is an example of packet
fragmentation happening. It
would affect the
performance of many
interactive and real-time
applications. Generic speed
tests are hopelessly poor at
detecting these phenomena.
Performance is sensitive to
packet size, but not all
performance tests take
account of this. We have
seen major operators use a
single packet size for all their
tests. The industry needs to
adopt rigorous scientific
management of
performance.
31. 31
Help me to understand…
…what it shows? …what it means? …what to do about it?
The variable portion of delay
(V) in the upstream direction
is slowly declining over the
period of the experiment.
This initially looks like some
kind of non-stationarity.
This one is us playing a trick
on you! The network is in fact
stationary. What this data
actually shows is ‘clock drift’
in the measurement system,
due to using low-cost
apparatus with less stable
internal clocks. It is a sign
that our data is ‘real’ and
‘honest’. We can correct for
this kind of issue, but have
chosen not to here.
First you need to know
whether what you are
looking at is ‘real’ or an
artefact of the measurement
process. To capture high-
quality data you must
identify and quantify the
’junk and infidelity’ of your
metrics, measures and
performance models.
34. 34
Help me to understand…
…what it shows? …what it means? …what to do about it?
The time taken to serialise
and deserialise packets over
network links (S – shown in
green) is stable. However,
the base delay of the
network is constantly rising
each day, before resetting
once it reaches some
maximum. This is non-
stationarity of G (shown in
blue).
The VDSL system is self-
optimising in some way that
means ‘G’ is not stationary.
However, small cells assume
stationary ‘G’ in order for
their timing systems to work.
Nobody can blame the
operator or regulator for this,
since stationarity of G was
never a policy or engineering
requirement. However, the
assumption in 5G business
plans is that cheap and
ubiquitous backhaul will be
available. This highlights the
need for forward planning
in core infrastructure.
36. 36
Help me to understand…
…what it shows? …what it means? …what to do about it?
This data shows how neither
the geographic delay (here,
in green), nor the size-
related delay (in blue), have
stable properties. The gap is
where no measurements
were recorded; it is not an
outage.
This non-stationarity is like
a building with wet rot in the
basement, and dry rot in the
windows. It means adaptive
and learning protocols
cannot operate well over the
long run.
The regulatory system
needs a performance
management upgrade to
ensure that national
infrastructure is fit-for-
purpose over the long run.
38. 38
Help me to understand…
…what it shows? …what it means? …what to do about it?
This shows the geographic
delay (in green) moving
within tight bounds, whilst
the size-related delay (in
blue) is consistent. The scale
makes it look like there is
higher variability than there
really is.
This is essentially a good
service with relatively
stationary properties. What
it does illustrate is how the
the Internet has ‘weather’,
with frequent variability.
Vendors, operators and
regulators need to get a grip
on their ‘meteorology’ and
‘climatology’. There is a need
for ‘geoengineering’ of these
systems to deliver the
‘weather’ properties we
desire. This requires ISPs to
define their intentional
semantics and manage to
that requirement.
39. 39GPON access to AWS in other country
Notice load-balancing effects
40. 40
Help me to understand…
…what it shows? …what it means? …what to do about it?
With G (in green) we see
banded striations, where
there are two distinct levels
of delay at the same time.
There are also both outliers
of G and S (in blue). This is
non-stationarity since there
is high variation in the delay
of G and S.
The G effect is the result of
load balancing, with packets
taking different routes. This
is fine if you take one rout
today vs another tomorrow.
However, if you hash IP/port,
there can be massive
inconsistency, with video
going one way, audio
another, and being out of
sync. Hence this can be a
significant QoE impact.
Is this behaviour within
specification or not? Given
there typically is no
performance specification, it
must all be acceptable. The
industry needs to consider
the stationarity of G, S and V
both independently as well as
collectively. Mere round-trip
times, jitter and average
loss rates are not enough!
42. 42Even the core Internet
links aren’t stable!
Geographic delay (G) (ms)
Sizedelay(S)(ms)
Stationary
Non-stationary
43. 43
Help me to understand…
…what it shows? …what it means? …what to do about it?
This is an analysis of
longitudinal data of a portion
of the core Internet
backbone over a 10Gbit/sec
path. It is a cluster diagram of
G vs S. The S is small, and
tightly managed. However,
we see some “normal
outliers” (big abnormal ones
have been excluded).
We would expect S to be
constant with respect to G,
but it isn’t. The widespread
assumption that core
networks are dead stable is
false: there is non-
stationarity.
We need to move to digital
supply chain management
to be able to manage these
performance phenomena
over multiple technical and
management boundaries.
45. 45About these high-fidelity
measurements of network non-stationarity
• They are selected from a variety of projects done for both private
clients as well as publicly-funded research projects.
• The measurements are done in the upstream and downstream,
capturing the loss and delay of individual test packets, and their
resulting probability distribution.
• There has been no attempt to replicate these specific
phenomena, analyse their temporal frequency or spatial
distribution, or isolate their root causes.
• We have not addressed packet loss in these teaching examples,
but it is fully incorporated into the measurement and modelling
methodology.
46. 46How are these extremely precise
network quality measurements made?
Packet flow
“wind tunnel”
Test traffic with
special statistical properties
Packet flow
“functional MRI scan”
High-resolution space
and time observations
Quality attenuation
science
New ∆Q mathematics and
methods for data analysis
47. 47These are not ‘speed tests’:
We are measuring quality, not quantity
• A ‘speed test’ is like asking if your
electricity supply can power an
overnight storage heater.
• A ‘stationarity test’ is like asking
if the power supply is of sufficient
stability to drive a motor at a
constant speed.
• The latter contains far more
information than the former.
• For more on the inherent
limitations of broadband speed
tests see here.
48. 48Additional reading about the
measurement techniques and ∆Q
• For the core science and mathematics of ∆Q see
qualityattenuation.science or the PhD of Dr Dave Reeve.
• How to X-ray a telecoms network shows our measurement
method and tools.
• Fundamentals of network performance engineering for G, S & V.
• What is ‘stationarity’, and why does it matter?
• Examples of using high-fidelity ∆Q metrics at CERN (video at 40
million frames/sec) and Kent Public Service Network
• The properties and mathematics of data transport quality
• Network performance optimisation using high-fidelity measures
49. 49
To learn more…
Engineered experiences for broadband
www.justright.network
Bespoke measurement and modelling
www.pnsol.com
Educational services and consultancy
www.martingeddes.com