Using SD-WAN over the internet required understanding latency variation on different paths, This presents results compairing various global city pairs and how latency varies, along with a comparison to AWS backbone performance for the same paths.
3. Methodology: Tools used
TOOL DESCRIPTION
Cedexis • Cedexis Measures Internet Health
• Collects 14 billion RUM data points per day
• Data collection methodology
Catchpoint • Catchpoint is a leading monitoring system
• Data collection methodology
• We set up tests from last mile agents to Speedtest last mile servers
Speedtest • Ookla/Speedtest is a leading last mile performance tool
• Used Catchpoint to test against Speedtest servers
4. Methodology: Definition of “Response”
Response = Send + Wait
• Excludes all one-time setup/negotiations like DNS and Connect
• Response is a better measure of real Internet response than Ping
DNS SERVER
WEB SERVER
CLIENT CLIENT1. DNS Lookup 2. Connect 3. TLS 4. Send 5. Wait 6. Load
a b a b c a a b c
RESPONSE
6. Methodology
Calculate CORE by subtracting the last mile
FIRST
MILE
MIDDLE
MILE
LAST
MILE
X
Y
Core
Response
Core
Response Long Haul
(Cedexis)
Response Last Mile
(Catchpoint/Speedtest) =-
7. Results
Long Haul (Cedexis) Last Mile (Catchpoint/Speedtest) Core
End User Location End User ISP Origin Server
Median
(ms)
SD
(ms)
Variance
(ms2)
Median
(ms)
SD
(ms)
Variance
(ms2)
Median
(ms)
SD
(ms)
Diff Variances
(ms2)
Bangalore Atria SJC AWS 224 125 15625 3 5.88 34.57 221 125 15590
Bangalore Atria London AWS 148 92 8464 3 5.88 34.57 145 92 8429
Bangalore Atria Tokyo AWS 119 99 9801 3 5.88 34.57 116 99 9766
Bangalore Atria Sydney AWS 295 106 11236 3 5.88 34.57 292 106 11201
DC Cox SJC AWS 101 119 14161 16 6.08 36.97 85 119 14124
DC Cox Tokyo AWS 186 112 12544 16 6.08 36.97 170 112 12507
DC Cox Sydney AWS 264 163 26569 16 6.08 36.97 248 163 26532
Tokyo Ucom SJC AWS 105 102 10404 4 7.46 55.65 101 102 10348
Tokyo Ucom London AWS 205 98 9604 4 7.46 55.65 201 98 9548
Tokyo Ucom Sydney AWS 191 68 4624 4 7.46 55.65 187 68 4568
London BT SJC AWS 174 123 15129 10 12 144.00 164 122 14985
London BT Tokyo AWS 271 160 25600 10 12 144.00 261 160 25456
London BT Sydney AWS 348 182 33124 10 12 144.00 338 182 32980
Melbourne Singtel/Optus SJC AWS 198 92 8464 6 5.2 27.04 192 92 8437
Melbourne Singtel/Optus London AWS 337 191 36481 6 5.2 27.04 331 191 36454
SF Comcast London AWS 166 134 17956 17 9.08 82.45 149 134 17874
SF Comcast Virginia AWS 89 103 10609 17 9.08 82.45 72 103 10527
8. Why the Problem in the Middle Mile?
• $$$ 20X* investment
• paid for by customers
• $ 6X* investment
• least cost peering and routing
• $$$ 50X* investment
• paid for by customers
FIRST
MILE
LAST
MILE
* Source: Akamai
Telia
Carrier
GTT
LEVEL 3
9. Example: Core & Last Mile Traceroutes
FIRST
MILE
MIDDLE
MILE
LAST
MILE
12. Middle Mile (Core) Median (ms) SD (ms)
City Backbone Origin Server Part 1 Part 2 Part 1 Part 2
Bangalore Tata SJC AWS 221 224 125 153
Bangalore Tata London AWS 145 153 92 50
Bangalore Tata Tokyo AWS 116 114 99 66
Bangalore Tata Sydney AWS 292 300 106 154
DC Comcast SJC AWS 85 74 119 35
DC Comcast Tokyo AWS 170 191 112 118
DC Comcast Sydney AWS 248 250 163 346
Tokyo NTT SJC AWS 101 109 102 70
Tokyo NTT London AWS 201 230 98 81
Tokyo NTT Sydney AWS 187 110 68 84
London BT SJC AWS 164 181 122 228
London BT Tokyo AWS 261 271 160 273
London BT Sydney AWS 338 336 182 346
Melbourne Telstra SJC AWS 192 168 92 9
Melbourne Telstra London AWS 331 307 191 1725
SF Level3 London AWS 149 145 134 68
SF Level3 Virginia AWS 72 76 103 71
Results
This one is interesting.
Telstra is doing something
special with SJC AWS
15. Part 2 Backbone - AWS Part 3 AWS – AWS Part 2 Backbone – AWS Part 3 AWS - AWS
Backbone Agent AWS Agent Origin Server Median (ms) Median (ms) SD (ms) SD (ms)
Tata Mumbai AWS SJC AWS 224 240 152.86 6.18
Tata Mumbai AWS London AWS 153 113 50.13 3.63
Tata Mumbai AWS Tokyo AWS 114 121 65.78 4.41
Tata Mumbai AWS Sydney AWS 300 228 154.17 9.43
Comcast DC AWS SJC AWS 74 61 34.54 9.91
Comcast DC AWS Tokyo AWS 191 172 118.08 7.63
Comcast DC AWS Sydney AWS 250 205 345.50 5.22
NTT Tokyo AWS SJC AWS 109 113 69.81 6.70
NTT Tokyo AWS London AWS 230 247 81.13 6.19
NTT Tokyo AWS Sydney AWS 110 104 83.60 11.78
BT London AWS SJC AWS 181 137 227.80 10.51
BT London AWS Tokyo AWS 271 247 272.98 122.24
BT London AWS Sydney AWS 336 281 345.50 16.87
Telstra Sydney AWS SJC AWS 168 147 9.36 10.59
Telstra Sydney AWS London AWS 307 280 1725.37 16.77
Level3 SJC AWS London AWS 145 140 68.11 52.87
Level3 SJC AWS Virginia AWS 76 63 71.17 91.82
DC Azure SJC AWS 73 9.47
Results
Poor stability
AWS-to-AWS on-net
is significantly more
consistent/reliable
AWS-to-AWS on-net has better/lower latency
16. Conclusions
• Internet variability is most dramatic in the core
• On high latency paths, to provide the stable performance
needed by latency sensitive applications, use a private
network such as:
• SD-CORE accessed via internet VPN to local POPs, or
• MPLS, which requires fiber or wire connections
• SD-WAN using two internet connections can mitigate much
of the public internet variability, depending on overall path
17. Conclusions
• Understand that latency varies on the internet
• Look at your application requirements for latency and packet
loss
• If a median latency is 70ms with a SD of 34 ms is adequate
for your application performance, you don’t need MPLS with
two internet circuits.
• If median latency for a path is 307 ms with an SD 1,725 ms,
you might not want to depend purely on the internet
• There are differences in internet backbones.
This project began as a result of the widespread marketing of SD-WAN as a way to replace for MPLS. I have seen plenty of SD-WAN implementation work very well using internet connectivity only. But I also don’t believe the hype when it comes to making recommendations to enterprise clients that are asking for my advice. So while working with a client with offices in India and performing analysis with Netflow and ThousandEyes, I had statistical evidence of this variability. This lead to the further study that I will summarized today.
RUM data points are “Real Use Monitoring” data points.
Cedexis Radar collects data from more than 50,000 networks daily with fees from 130 service providers.
Real user monitoring (RUM) means fully understanding how internet performance impacts customer satisfaction and engagement.
Cedexis Radar gathers RUM data from each step between the client and any of the clouds, data centers, and CDNs hosting your applications
to build a holistic picture of internet health. Every request creates more data, continuously updating this unique real-time virtual map of the web.
A small piece of nonblocking RUM specific JavaScript is inserts into the designated web pages. When an end user visits a RUM enabled page, the RUM JavaScript will collect performance data and beacon it back to Akamai via a 1x1 pixel. The data is then processed and stored for visualization within the portal.
Catchpoint has a global network of 700+ monitoring nodes spanning backbone, broadband, cloud, enterprise, last mile, and wireless. Comprehensively detect issues across third party services, CDNs, DNS, APIs, cloud providers, networks, systems, and more.
Ookla/Speedtest uses up to four HTTP threads during the download and upload portions of the test. ... After the pre-test, if the connection speed is at least 4 megabits per second, then Speedtest.net will use four threads. Otherwise, it will default to two threads.
So how did we use these tools?
When measuring Response, we are measuring the send and wait time. This is better than ping, which can have a low priority response and not be an accurate measure of response.
But we exclude DNS and connect, whose response times can vary, unrelated to our measurement goals and therefore make our results invalid.
So how did we calculate the Internet Core performance?
The CORE or middle mile, was calculated by measuring the Long Haul response (using Cedexis) and subtracting the Last Mile response, using Catchpoint and Speedtest, which are the tools I described in slide 2.
The results are quite striking. For example:
Bangalore to San Jose: 221ms with a SD of 125 ms. So latency could be as high as 346 ms
London to Sydney: 338 ms with a SD of 182 ms. So latency could be as high as 520 ms
Even a relatively short path like SF to Virginia: 72ms with an SD of 103ms. Or as high as 175ms!
For latency sensitive applications like voice or video would you feel comfortable with depending 100% on the internet for these long paths?
Definition
Variance (symbolized by S2) is a measure of how spread out a data set is. It is calculated as the average squared deviation of each number from the mean of a data set.
Standard deviation (the square root of the variance, symbolized by S).
Why is the problem in the middle mile?
There is far more investment in the first and last mile.
Least cost peering and routing is going to take the more cost effective path, unless you are willing pay.
While you surely cannot read this. The point is this:
With this 216 ms path from India to San Jose.
The last mile traceroute has 5 hops, zero packet loss and two ASNs
The core traceroute has 18 hops across 6 ASNs and 3 countries.
Think of the potential BGP changes across this path
So let’s look at the Core measurements
Part 2 = Internet Core
Part 3 = AWS backbone
DC Azure to SJC AWS is slower than DC AWS to SJC AWS