Back to Rings but not Tokens: Physical and Logical Designs for Distributed Filesystems intended for Bulk Transfer over E2E Emulated Cut-Through Circuits
Previous research has solidified the core idea referred to as circuit emulation. The chief premise is that bulk transfer -- Big Data shards, for a practical example -- can achieve the best e2e throughput when all the switches along the path are in the cut-through mode, as opposed to the store-and-forward mode when under any form of contention. Most networking technology in the world today, including network virtualization (SDN, NFV, etc.) runs under contention. This paper looks into physical and logical network designs that provide a degree of control over contention which, in turn, creates room for efficient scheduling of e2e circuits. The unit of new mode of networking is a ring, but -- rather than the comeback of Token Rings, this paper will talk about partially overlapping asynchronous rings.
Similar to Back to Rings but not Tokens: Physical and Logical Designs for Distributed Filesystems intended for Bulk Transfer over E2E Emulated Cut-Through Circuits
Disaggregated Networking - The Drivers, the Software & The High AvailabilityOpen Networking Summit
Similar to Back to Rings but not Tokens: Physical and Logical Designs for Distributed Filesystems intended for Bulk Transfer over E2E Emulated Cut-Through Circuits (20)
Back to Rings but not Tokens: Physical and Logical Designs for Distributed Filesystems intended for Bulk Transfer over E2E Emulated Cut-Through Circuits
1. Back to Rings but not Tokens:
2015/11/17
Marat Zhanikeev
maratishe@gmail.com
IN研@熊本
PDF: bit.do/151117
Physical and Logical Designs for Distributed Filesystems
intended for Bulk Transfer over E2E Emulated Cut-Through Circuits
2. The Big Picture
1. increase capacity and flexibility at the same number of ports
2. do it at a relatively low cost
3. stay at hardware level -- namely the cut-through mode
• ultimate goal: a better storage grid for BigData 14
Super-duper
32-port
Software Switch
(ClickOS, SDN, NFV,...)
Simple
4-port
Simple
4-port
Simple
4-port
Simple
4-port
Simple
4-port
Simple
4-port
Simple
4-port
Simple
4-port
vs
>cost
capacity<<<
14 M.Zhanikeev "Streaming Algorithms for Big Data Processing on Multicore" Big Data: Algorithms..., CRC (2015)
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 2/22
2/22
3. Model of Per-Packet Overhead
C: Cut Through
Check,
etc. Q: Queue
D: Drop
QoS
classes
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 3/22
3/22
4. Circuits at a Scheduling Problem
Line=
outgoing
port
Overhead =
contention
No. of flows
Line=
outgoing
port
Overhead
Scheduling
Traditional
Circuits
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 4/22
4/22
5. The Tall Gate Model
• sensing is close to the wireless opportunistic/cognitive tech
• the target: make even long-haul e2e circuits possible (DC-DC)
Tall Gates
Bulks
Send
Highway
Sources
Destination
06 M.Zhanikeev "A City Traffic Model for Optical Circuit Switching in Data Centers" IEICE・OCS研 (2014)
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 5/22
5/22
6. Optical Circuits = OCS (no OBS/OPS)
E-O
ingress
O-E
egress
O-E-O
CORE
O-O
NOC
Management over Ethernet
Contention
Resolution
Bulk
E-O
ingress
O-E
egress
NOC
Management over Ethernet
Contention
Resolution, TE
O-O
CORE
Bulk
• OBS today is considered the best
technology
• however, when contention
management is moved to NOC,
OCS is feasible
• Tall Gate model is applicable
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 6/22
6/22
7. Circuits: Feature Comparison
• Tall Gate is better than Traditional Scheduler
• already discussed in DC networking 1112
Interference Overhead Isolation
Do Nothing HIGH ZERO NO
Network Virtualization HIGH HIGH NO
(store-and -forward)
Traditional Scheduler LOW HIGH YES (cut -through)
P2Px1N (1 network) HIGH VERY HIGH YES
P2Px2N (2 networks) ZERO VERY HIGH YES (cut -through)
Tall Gate (sensing) VERY LOW HIGH YES (cut -through)
11 "Cut-Through and Store-and-Forward Ethernet Switching for Low-Latency Environments" Cisco White Paper (2014)
12 G.Wang+5 "c-Through: Part-time Optics in Data Centers" ACM SIGCOMM (2010)
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 7/22
7/22
8. Circuits: Performance under Hotspots =bigdata
0 2 4 6 8
Ordered list
0
0.45
0.9
1.35
1.8
2.25
log(duration)
Do Nothing Network VirtualizationTraditional Scheduler
P2Px1N P2Px2N Tall Gate
0 2 4 6 8
Ordered list
0
0.8
1.6
log(duration)
0 2 4 6 8
Ordered list
0.6
1.2
1.8
2.4
log(duration)
0 2 4 6 8
Ordered list
1.5
1.8
2.1
2.4
log(duration)
0 2 4 6 8
Ordered list
1.65
1.95
2.25
2.55
log(duration)
0 2 4 6 8
Ordered list
1.2
1.6
2
2.4
log(duration)
Size: 10M..100M Size: 100M..500M
Size: 500M..1G
Size: 10G..100G
Size: 1G..10G
Size: 10G..50G
2.4
• hotspot traffic: few very
large flows
• group by size range
• compare transfer time
across the models
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 8/22
8/22
9. Today: The Basic Idea
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 9/22
9/22
10. The Basic Idea: Logical Circuits
Physical
Topological
A switch 4 ports 8 ports …
Legend
Port
Switch
Node
Logical link
(domain)
Physical link
…
…
<2,2,2>
<3,1,2>
<1,4,4>
<3,5,3>
• take standard
switches (cut-through OK
on most)
• design isolated
rings within the
available port
• ⟨a, b, c⟩ means: a
ports/nodes on inner
ring, b outer rings, c
largest hub/port
• implementation:
VLANs on Ethernet,
MEMS in optical,
etc.
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 10/22
10/22
11. Deep/Dark Theory
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 11/22
11/22
12. Deep/Dark Theory (1)
<x1,y1,z1>
<x2,y2,z2> <x2,y2,z2> …
<xn,yn,zn>
…
<x2,y2,z2>
… … …
1..m (different for each level)
n
…
k Processing nodes
Storage nodes
Connections
to top level
…
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 12/22
12/22
13. Deep/Dark Theory (2)
• can be used to analyze connectivity using adjacency matrices
N11 N12
N21
Nn1 Nnq
N1p…
N2m
… …
… …
…
… Processors
Storage grid
row a
row b
a size
a size
a x a matrix
of peer mesh
a size b size
a x b matrix
between levels
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 13/22
13/22
14. The Much Simpler Practice
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 14/22
14/22
15. The Much Simpler Practice (1)
• in concept, close to harness braiding (photo)
• similar to a Google paper on the design of its in-rack and
rack-rack crosses
• immediately obvious: at least 2 distinct routes between
any 2 nodes
4 ports
<1,4,4>
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 15/22
15/22
16. The Much Simpler Practice (2)
• with ⟨3, 5, 3⟩ tuple, much more flexibility
• scientifically interesting: where to connect extra ports (above
minimum connectivity) -- 2 or 3 switches away?
◦ ... and how does this affect overall connectivity?
8 ports
<3,5,3>
?
?
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 16/22
16/22
17. Switch Robotics?
• some book libraries have already been roboticized
• why not return to a telephone switchboard, but in a
robotic version?
• goal: dynamic management of connectivity using
robotically migrating ports
• mid-goal: switches with wobbling physical ports
+
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 17/22
17/22
18. That’s all, thank you ...
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 18/22
18/22
19. Application: BigData Replay
• Hadoop/MapReduce has failed 02 -- can only support (barely) 15k concurrent
users 01 (among many other problems)
• BigData Replay on massively multicore 14 is a valid alternative
Name Node
Storage Node (shard)
file A
file B
file C
…
Hadoop Space
Manager
Hadoop Job
(your code)
Hadoop Job
(your code)
Hadoop Job
(your code)
MapReduce
job (your code)
manymany
Name
Server(s)
Client Machine
Hadoop Client
Your
Code
You
Start Use
Deploy
FindRead/parse
many
Storage Node
(shard)
Time-Aware
Sub-Store(s)
Manager
Client Machine
Client
Your
Sketcher
You
Start Use
Schedule
Multicore
Replay
Replay Node
many
02 A.Rowstron+4 "Nobody ever got fired for using Hadoop on a cluster" 1st Hot Topics in Cloud Data Processing (2012)
01 K.Shvachko "HDFS Scalability: the Limits to Growth" the Magazine of USENIX, vol.35, no.2 (2012)
14 M.Zhanikeev "Streaming Algorithms for Big Data Processing on Multicore" Big Data: Algorithms..., CRC (2015)
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 19/22
19/22
20. Bigdata Replay on Massively Multicore
….
Time
Now
(buffer head)
Manager
Job
Job
Buffer
tail
pos
pos
Controller
Kill
2 Report
Manage
in realtime
One Replay Batch
One
Buffer
One
Buffer
One
BufferJobs
Jobs
Jobs
Replay at
a scale
1
• massively multicore ̸=
manycore 08
• 100+ cores on
conventional hardware --
standard RAM, shmap, etc.
• target: 100k jobs,
using Multiple
Replay nodes
08 M.Zhanikeev "...Massively Multicore, Heterogeneous Jobs with Hotspots, and Data Streaming" SWoPP (2015)
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 20/22
20/22
21. NextGen NOC
• circuits are really really really valuable when they considerably reduce bulk
transfer time
• future NOCs will develop, advertise, and sell their ability to provide
circuits
• part of NGN and autonomy network management -- see recent IETF/
MRTG meeting
NOC
10
M.Zhanikeev "The Next Generation of Networks is all about Hotspot Distributions and Cut-Through Circuits" IEICE・CQ研
(2015)
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 21/22
21/22
22. Hotspot Distribution
• a hotspot distribution consists of normal, popular and hot/flash sets
• describe a wide range of natural processes, traffic in particular
0 10 20 30 40 50
List of traffic sources
0
0.4
0.8
1.2
1.6
2
2.4
2.8
log(trafficvolume)
0 10 20 30 40 50
List of traffic sources
0
0.4
0.8
1.2
1.6
2
2.4
2.8
log(trafficvolume)
Magnitude=2 Magnitude=10
Hotspots
Normal
Hotspot
under
a Flash
event
M.Zhanikeev -- maratishe@gmail.com -- ...Rings but not Tokens: ...Distributed Filesystems ...over E2E Emulated Cut-Through Circuits -- bit.do/151117 22/22
22/22