A DRAM-friendly priority queue Internet packet scheduler implementation and its effects on TCP

Copyright (c) 2020, Katsushi Kobayashi. All rights reserved.
A DRAM-friendly priority queue
Internet packet scheduler
implementation and its eﬀects on
TCP
ikob@acm.org

Katsushi Kobayashi
1

1.Introduction

2.Hardware-based packet scheduler implementation

3.TCP behaviors with real end-systems

4.Deployment issues

5.Related work

6.Conclusion
2

Packet buffer on router
• Have a significant impact on applications QoE.

• Preferred buffer size depend on applications

• Throughput centric flows: large buffers up to BDP.

• Legacy FTP, Video streaming buffering at start.

• Latency sensitive flows: small as possible.

• VoIP, Interactive Web, Gaming.

• Existing FIFO cannot satisfy all of them:

• Built up queue worsens latency sensitive applications aka.
bufferbloat, if large.

• Suppressing rapid cwnd growth reduces throughput, if small.
3

AQMs against bufferbloat

Int-/Diff-serv among ISPs
• CoDel, PIE compromise throughput centric and latency
sensitive applications with accepting TCP slow-start.

• Transient queue build up still blocks latency
sensitive flows.

• Limit on "One-size fits all" approach.

• Deploying Int-/Diff-serv demands economic infrastructure
in addition to updating network facilities[RFC5290].

• DetNet focuses on closed controlled networks, NOT
on Internet.

• Best-effort service model should be used.
4

Latency Awareness

on a Future Internet
• Satisfy various buffer latency requirements within best-effort service.

• Architecture: Ends and networks work together:

• Applications indicate latency limit on IP header, e.g. ToS, DSCP.

• Routers schedule packets with Earliest Dead Line First (EDF) manner.

• Challenge: No resource management on best-effort service.

• In case of congestion on priority queue, elapsed deadline packets
block entire queue and cause infinite delays.
5
VoIP
Interactive
Web
OS Update
t2Latency = t3 t1

EDF with reneging (EDFR)

Scheduler
• Work as:

• Dequeue earliest deadline packet.

• Forward it, if deadline is NOT elapsed.
• Otherwise, discard it.
• No build up queue, even if congested.

• Similar loss property with limited FIFO:

1.Entire loss rates vs. traffic intensity
similar with limited FIFO which size
corresponds mean deadline[Kruk].

2.Loss rates distributions are almost flat
except quite short deadlines.

➡ No significant impact expected on TCP
flows.

• Confirmed by NS-2 simulations.
6
Kobayashi, K. "LAWIN: A Latency-AWare InterNet architecture for latency support
on best-effort networks." HPSR 2015.
Kruk, Łukasz, et al. "Heavy traffic analysis for EDF queues with reneging." The
Annals of Applied Probability 21.2 (2011).
10
30
50
100
150
D = 200, U(5,B)
10
30
50
100
150
D = 200, U(1,B)
0 100 200 300 400
1e041e031e021e011e+00
Deadline
FractionofRenege
System total by Theory*
EDF w Reneging, M/M/1
= 0.98, D:U(1,B), U(5,B)
≈ (1-ρ)/(1-ρ^(N+1))ρ^N
Blocking rate in M/M/1/N

Objectives
• Present the feasibility of latency aware Internet:

• Hardware-based EDFR packet scheduler
implementations able to support 100Gbps or
more.

• Investigate how TCP behaviors change with
EDFR by using real end-systems.
7

1.Introduction



4.Deployment issues

5.Related work

6.Conclusion
8

EDFR implementation
• DRAM is the only choice for packet buffer.

• 1.25GB for 100ms. buffer, where 100Gbps link.

• BW: 460GB/s@HBM2, 20GB/s@DDR4

• Random access latency: about 100ns.

• A priority queue that regards remaining time to the deadline
as priority.

• A lot efficient priority queue packet schedulers: 
Heap (O(log n)), Calendar queue (O(1)), ....

• Incompatible with DRAM due to random access nature. 
5ns. for 64bytes, where 100Gbps link.
9

Priority Queue:

Multiple ring buffers + Priority encoder.
• Naive, but compromised implementation.

• # of deadline is small, < 256

• 8bits for ToS, 6bits for DSCP.

• Able to represent up to 256ms with 1ms
granularity.

• Ring buffer FIFO:

• Can bring out DRAM BW performance  
by its sequential access nature.

• Compatible with variable packet size.

• Priority encoder:

• Regard the per-packet deadline as the
priority.

• On EDFR, dropped packets consume
memory BW in addition to forwarded.
10
Packet Buffer (Ring Buffer)
HeadTail
wr_ptr rd_ptr dequeue pkt.
enqueue pkt.
………
Class 0
Class 1
Class n
Enqueue
Send
Drop
Receive
Dequeue

Skip-FIFO
• Two FIFOs

• To reduce BW wastage by dropped packets.

1.Packet data :  
ring buffer with wr_ptr / rd_ptr

2.Timestamp + ptr. : 
At each skip interval,

• Input : (wr_ptr, T_now, deadline) 
Enqueue at each skip_interval.

• Output : (elapded_ptr, T_deadline) 
Dequeue, if T_deadline < T_now

• If elapsed_ptr > rd_ptr, 
rd_ptr = elapsed_ptr

• Skip FIFO + Priority encoder :

• Approx. implementation of EDFR relies on skip
interval.

• 12μs with 4K words FIFO for 200ms buffer
capacity. 
<< 10-15ms. for modern AQM update intervals.
11
Timestamp FIFO
Packet Buffer (Ring Buffer)
HeadTail
wr_ptr rd_ptr dequeue pkt.
enqueue pkt.
enqueue(wr_ptr, T_now + deadline)
at each skip_interval
dequeue(elapsed_ptr, T_deadline)
if elapsed_ptr > rd_ptr:
rd_ptr = elapsed_ptr
………
Class 0
Class 1
Class n
Enqueue
Send
Drop
Receive
Dequeue

DRAM-based EDFR

on FPGAs
• Implementations:

• For TCP behaviors with real-ends:

• Kintex-7 (28nm)

• NetFPGA-CML

• 512MB DDR3

• 4ports x GbE

• For throughputs:

• Xilinx Vertex U+ (16nm)

• Alveo U280-ES 
8GB HBM DRAM

• AWS F1 
64GB DDR4 DRAM

• Consume only 20% more LUTs than
ordinary ring buﬀer FIFO.
12
Scheduler
BRAM
(SRAM)
LUT FF
Skip-FIFO 10 1746 655
FIFO 2 1428 437
Virtual FIFO 
(Xilinx)
4 1169 1938
FIFO controllers' resource utilization with 64-bit
data width and 512-byte burst size.

Skip-FIFO throughputs 
with SRAM, DDR4, HBM
• Constant regardless of packet sizes (including metadata).

• Increase with larger transaction (AXI-MM burst length).

• HBM: 39Gbps @4KB burst, 1.8Gbps @64B

• DDR4: 60Gbps @4KB, 2.7Gbps @64B

• Entire system HBM >> DDR4, while single channel HBM2 < DDR4.

• HBM : 1.2Tbps @4KB (76% of theoretical max.)

• DDR4: 240Gbps @4KB
13
T1
T2
T3
S1
R1S2
S3
12ms
0ms
25ms
. . .
Skip-FIFO bandwidth throughputs.(a) Single channel throughputs. The edged bar areas eliminate metadata overhead.
(b) Entire system throughputs aggregating available memory channels.

1.Introduction



4.Deployment issues

5.Related work

6.Conclusion
14

TCP behaviors with EDFR

on real end systems.
• Emulation system:

• Network switch:  
NetFPGA-CML as 4-ports switch with
EDFR scheduler supporting 3-delay
classes.

• Hosts: Ubuntu 18.04

• Link Delay: Linux NetEm.

• Traffic generator: Flowgrind

• 3 evaluation scenarios:

1.Confirm deadline support scheduling on
EDFR.

2.Loss-throughputs with Web-like traffic.

3.Throughputs competing flows requesting
different deadlines.

Follows TCP evaluation suite[draft-irtf-iccrg-tcpeval-01].
15
T1
T2
T3
T4
T5
T5
S1
R1S2
S3
S4
R2 S5
S6
75ms
12ms
0ms
25ms
37ms
2ms
R
T1
T2
37ms
×
. . . . . .
T1
T2
T3
T4
T5
T5
S1
R1S2
S3
S4
R2 S5
S6
75ms
12ms
0ms
25ms
37ms
2ms
R1 R2
T30msT1
T2 T4
37ms

1. Per packet deadline support
on EDFR.
• Generate two 3x3 TCP long-lived CUBIC flow
groups.

• Each 3x3 flow group has different deadline, e.g.,
30-100ms.

• In case of FIFO, buffer capacities are 100ms.

• Figs. show CDF of "queueing delay" aggregating
each deadline.

• Confirmed deadline support on EDFR.

• Note: (c)100-100ms delays is similar with (f) Shared
FIFO rather than (e) Dedicated FIFOs.
16
T1
T2
T3
T4
T5
T5
S1
R1S2
S3
S4
R2 S5
S6
75ms
12ms
0ms
25ms
37ms
2ms
R1 R2
T30msT1
T2 T4
37ms
×
×
. . . . . .
A 6×6 dumbbell topology that comprises two 3×3 flow groups,
i.e., (T1...T6) and (R1...R6). All links have a capacity of 1 Gbps.

2. Loss and Throughput

with moderate load
• Generate two 3x3 flow groups.

• 3GPP HTTP model traffic instead of real traffic
trace.

• Consume 80-90% bottleneck BW.

• Throughputs: No significant differences were not
found for all deadline combinations.

• Loss: longer deadline flow has slightly higher loss
than shorter deadlines.

• Disagreed with ns-2 simulations and EDFR
nature, but not significant.
17
T1
T2
T3
T4
T5
T5
S1
R1S2
S3
S4
R2 S5
S6
75ms
12ms
0ms
25ms
37ms
2ms
R1 R2
T30msT1
T2 T4
37ms
×
×
. . . . . .
A 6×6 dumbbell topology that comprises two 3×3 flow groups,
i.e., (T1...T6) and (R1...R6). All links have a capacity of 1 Gbps.

3. Flow Completion Time (FCT)
with competing flows
• Generate two flows requesting different deadlines.

• Long-lived traffic.

• 2nd flows started after 100s.

• Consume 100% bottleneck BW.

• Upper bars:

• FCTs of 1.5GB with two different deadline flows.

• FCTs are almost equal in all deadline combinations.

• Lower:

• All FCTs grew in steady as well as FIFO.
18
T4
T5
T5
S4
S5
S6
75ms
37ms
2ms
R1 R2
T30msT1
T2 T4
37ms
×
×
Has two pairs of nodes, (T1,T2) and (T2,T3). All
links have a capacity of 1 Gbps. The dumbbell in
Flow Completion Time (FCT) of 1.5GB data
competing different deadline flows.
Flow Completion Time(FCT) plots with two TCP
CUBIC flows competing (a) 60-80 ms deadline pair
on EDFR, and (b) on FIFO.

TCP behaviors summary
• Exiting TCP stacks will work well, if ordinary FIFO
schedulers are replaced with EDFR.

• Most properties of CUBIC were fully retained with
Reno as well.
19

1.Introduction



4.Deployment issues

5.Related work

6.Conclusion
20

Latency aware Internet
deployment
• Applications:

• Able to adapt latency support by just calling
existing socket API setting IP ToS field.

• Routers:

• Expect to reduce packet buffer size as a result of
explicit per-flow deadline declarations.

• Economic infrastructure is NOT required since it
is best effort service.
21

1.Introduction



4.Deployment issues

5.Related work

6.Conclusion
22

Related work
• CoDel, PIE reduce packet buffer latency.

• Target delay is fixed.

• Allow transient queue build up.

• Least Slack Time First (LSTF) scheduler takes account of only buffered
delay as our architecture.

• But, LSTF considers cumulative buffered delay unlike per-hop basis.
23

Conclusion

Future work
• Feasibility of Latency aware Internet:

• DRAM-friendly EDFR packet scheduler able to
support 1Tbps or more.

• TCP behaviors almost unchanged on EDFR by
using real end-systems.

• Work together with emerging UDP-based transport:

• HTTP priority can map to deadline.
24

A DRAM-friendly priority queue Internet packet scheduler implementation and its effects on TCP

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A DRAM-friendly priority queue Internet packet scheduler implementation and its effects on TCP

Similar to A DRAM-friendly priority queue Internet packet scheduler implementation and its effects on TCP (20)

Recently uploaded

Recently uploaded (20)

A DRAM-friendly priority queue Internet packet scheduler implementation and its effects on TCP