Rev PA1Rev PA1 1
Keren Bergman
Department of Electrical Engineering
Lightwave Research Lab, Columbia University
New York, NY, USA
Silicon Photonics for Extreme
Scale Computing
–
Challenges & Opportunities
Rev PA1Rev PA1 2
Trends in extreme HPC
• Evolution of the top10 in
the last six years:
– Average total compute power:
• 0.86 PFlops  21 PFlops
• ~24x increase
– Average nodal compute power:
• 31GFlops  600GFlops
• ~19x increase
– Average number of nodes
• 28k  35k
• ~1.3x increase
 Node compute power main contributor to performance growth
2010 2011 2012 2013 2014 2015 2016
0.1
1
10
100
Node computer power (FLOPs)
Number of nodes
Total system compute power (FLOPs)
Average of top 10 sytems, relative to 2010
24x
19x
1.3x
[top500.org, S. Rumley, et al. Optical Interconnects for Extreme
Scale Computing Systems, Elsevier PARCO 64, 2017]
Rev PA1Rev PA1 3
Interconnect trends
• Top 10 average node level
evolutions:
– Average node compute power:
• 31GFlops  600GFlops
• ~19x increase
– Average bandwidth available
per node
• 2.7GB/s  7.8GB/s
• ~3.2x increase
– Average byte-per-flop ratio
• 0.06 B/Flop  0.01 B/Flop
• ~6x decrease
• Sunway TaihuLight (#1) shows 0.004 B/Flop !!
 Growing gap in interconnect bandwidth
2010 2011 2012 2013 2014 2015 2016
0.1
1
10
Node computer power (FLOPs)
Node bandwidth (bit/s)
Byte-per-flop ratio
Average of top 10 sytems, relative to 2010
19x
3.2x
0.17x
[top500.org, S. Rumley, et al. Optical Interconnects for Extreme
Scale Computing Systems, Elsevier PARCO 64, 2017]
Rev PA1Rev PA1 4
• Real Exascale goal: reaching an Exaflop…
• …while satisfying constraints (20MW, $200M)
• …with reasonably useful applications
• Assume 15% of $ budget for interconnect:
– 15% x $200M / 500 Pb/s = 6 ¢/Gb/s
– Bi-directional links must thus be sold for ~10 ¢/Gb/s
• Today: optical 10$/Gb/s
electrical 0.1-1 $/Gb/s
• Assume 15% of power budget for interconnect:
– 15% x 20MW / 125 Pb/s = 24 mW/Gb/s = 24 pJ/bit
= budget for communicating a bit end-to-end
 6 pJ/bit per hop
 4 pJ/bit for switching today ~20 pJ/bit
 2 pJ/bit for transmission today ~10 pJ/bit (elec) ~20 pJ/bit (optical)
Exascale interconnects – power and cost constraints
1.25 ExaFLOP
X 0.01 B/FLOP
= 125 Pb/s injection BW
X 4 hops
= 500 Pb/s installed BW
5
The Photonic Opportunity for Data Movement
 Energy efficient, low-latency, high-bandwidth data interconnectivity is the core
challenge to continued scalability across computing platforms
 Energy consumption completely dominated by costs of data movement
 Bandwidth taper from chip to system forces extreme locality
Reduce Energy Consumption Eliminate Bandwidth Taper
10
100
1,000
10,000
100,000
1,000,000
Bandwidth:Gb/sec/mm
Electronic
Photonic
0.01
0.1
1
10
100
1000
Chip Edge PCB Rack System
DataMovementEnergy
(pJ/bit)
Electronic
Photonic
Rev PA1Rev PA1 6
Default interconnect architecture
Node Router Router Node
Long distance
(router-router) link
Short distance
(router-router) link
Electrical transceivers
Optical transceivers
(currently 10-20% of
router-router links)
Router
Node
Short distance
NR (node-router) link
N compute
nodes in system
Node
Node
Node
One rack
Rev PA1Rev PA1 7
Router
Core
Exascale optical interconnect ?
• Requirements for that to happen:
– Divide cost by 1.5 orders of magnitude at least
– Improve energy-efficiency by one order of magnitude at least
Node Router
Core
Router
Core
Node
Long distance RR link
Electrical transceivers
Optical transceivers
Node
N compute
nodes in system
Node
Node
Rev PA1Rev PA1 8
Exascale supercomputing node
• Consider 50k nodes
• Total injection bandwidth ~ 100 Pb/s
• 4 ZB per year at 1% utilization
• Total cumulated unidirectional bandwidth: ~ 500 Pb/s
Compute power:
From 10 to 30 Teraflop (TF)
Interconnect bandwidth:
0.01 B/F  0.8 – 2.4 Tb/s
Bulk memory bandwidth:
0.1 B/F  8 - 24 Tb/s
Near memory bandwidth:
10-30 TF x 8bit x 0.5B/F = 40 - 120Tb/s
Requirements
for next-
generation
interconnects!
Ideal case:
Back to 0.01B/F to
ensure well fed nodes
Silicon Photonics: all the parts
• Silicon as core material
– High refractive index; high contrast;
sub micron cross-section, small
bend radius.
• Small footprint devices
– 10 μm – 1 mm scale compared to
cm-level scale for telecom
• Low power consumption
– Can reach <1 pJ/bit per link
• Aggressive WDM platform
– Bandwidth densities 1-2Tb/s pin IO
Switching
WDM Modulation &
Demultiplexing
Modulation
Detection
•Silicon wafer-scale CMOS
– Integration, density scaling
– CMOS fabrication tools
– 2.5D and 3D platforms
S. Rumley et al. "Silicon Photonics for Exascale Systems”, IEEE JLT 33 (4), 2015.
Photonic Computing Architectures: Beyond Wires
10
• Leverage dense WDM bandwidth density
• Photonic switching
• Distance-independent, cut-through, bufferless
• Bandwidth-energy optimized interconnects
On chip
Short distance PCB
Long distance PCB
Optical link
Conventional hop-by-hop
data movement
Fully flattened end-to-end
data movement
12
conversions!
No conversion!
Silicon Photonic Link Design
clk
dataTIA
clk
dataTIA
clk clk clk
data
R
C
clk
dataTIA
clk gen
Silicon waveguide Silicon waveguide
• Co-existence of Electronics and Photonics
• Energy-Bandwidth optimization
Clock Distribution
Serialization of Data
Driver
Comb Laser
Vertical Grating Coupler
Coupled Waveguides
OOK Modulation
Fiber
Thermal Tuning
Photodiode
“0”
“1”
λ
Deserialization
Demultiplexing Filter
Amplifier
[1] M. Bahadori et al. “Comprehensive Design Space Exploration of
Silicon Photonic Interconnects," IEEE JLT 34 (12), 2015.
Rev PA1Rev PA1 12
Utilization of Optical Power Budget
Photonic Elements
From Electronics To Electronics
Coupler
Loss
Sensitivity level of receiver
Demux Array Penalty
Coupler
Loss
Coupler
Loss
Modulator Array Penalty
Available
Laser
power
Maximum Available Budget per λ
Max 5 dBm
per λ
Extra
Budget
Maximum link utilization
Rev PA1Rev PA1 13
All-Parameter Optimization: Min Energy Design
Photonic Elements
From Electronics To Electronics
Optimal design based on physical dimensions Breakdown of consumption for given bandwidth
1 pJ/bit
Rev PA1Rev PA1 14
3 5 10 15 20 25
Optical channel ( ) linerate [Gb/s]
0
1
2
3
Linkenergyefficiency[pJ/bit] 10% Laser eff.
30% Laser eff.
Highly parallel, “SERDES-less” links
• Investigate “many-channel” architectures with low bitrate wavelengths
– Leads to poor laser utilization
• Only possible with high laser efficiency
– But may allow drastic simplification of drivers and SERDES blocks
400G
Rev PA1Rev PA1 15
0 10 20 30
0
4
8
12
Cost per bandwidth – declining but slowly
• Today (2017):
– 100G (EDR) best
$/Gb/s figure
– Copper cable have
shorter reaches due to
higher bit-rate
– Optics: Not even ½
order of magnitude
price drop over 4
years
– But electrical-optical
gap is shrinking
[Besta et al. “Slim Fly: A Cost Effective
Low-Diameter Network Topology”,
SuperComputing 2014]
[may 2017]
Mellanox EDR
100 Gb/s
Length [m]
Cost[$/Gb/s]
Rev PA1Rev PA1 16
Packaging and connector drive costs
[Molex]
[Barwicz/IBM, OFC 2017]
Rev PA1Rev PA1 17
Toward integrated optical transceivers
• Fundamental step to reduce cost and power of optics: co-integration
Node Router
Core
Router
Core
Node
Long distance RR link
Electrical transceivers
Optical transceivers
Router
Core
Node
Short distance
NR link
Rev PA1Rev PA1 18
Towards integrated, general-purpose optical
transceivers
Node Router
Core
Router
Core
Node
Electrical transceivers
Integrated optical transceivers
Router
Core
Node
Short distance
NR link
• Fundamental step to reduce cost and power of optics
Rev PA1Rev PA1 19
Cost vs. energy
• Transceiver procurement cost dominates TCO
– Same energy/procurement ratio as Cori (30% / 70%) with 0.4 $/Gb/s
• This is a factor of 10 from the current cost figure
Hypotetical 0.1$/Gb/s transceivers (8 years)
Hypotetical 1$/Gb/s transceivers (8 years)
Typical 100G transceivers (8 years, 2.5W, $500)
Titan (6 years, 8.2 MW, M$97)
Cori (8 years, 3.9 MW, M$100)
Roadrunner (5 years, 2.35 MW, M$100)
0% 20% 40% 60% 80% 100%
4
8
Desktop PC (5 years, 100W, $2000)
Cost of energy over lifetime ($100/MWh)
Procurement cost
Rev PA1Rev PA1 20
The bright side of optical switching
• Optical switches can have astonishing aggregate bandwidths
– Absence of signal introspection
– Possibility to receive dense WDM signals (> 1 Tb/s) on each port
• Translates into alluring $/Gb/s figures:
 Optical switching >10 cheaper than electrical packet routing…
Type # ports Bandwidth
per port
Total
bandwidth
Price Price per
Gb/s
Calient
S320
Mems
switch
320 ports 400Gbps
(with 16
wavelengths
at 25G –
CDAUI-16
signaling)
128 Tb/s $40k 0.3 $/Gb/s
Mellanox
SB7790
36 ports 100 Gbps
(4xEDR
Infiniband)
3.6 Tb/s $12k 3.3 $/Gb/s
Rev PA1Rev PA1 21
Optical switch vs. Electrical packet router
• (Random access) buffering plays a crucial role.
– Without buffering, end-to-end scheduling is required
– With buffering, scheduling made link-after-link
– Without buffering, no back-to-back transmissions
• Packet routers also allow differentiated QoS, error correction, etc.
 Packet router offers much higher “value” than optical switches
My personal guess:
At least 30x more value (for 10ns optical switching time)
Optical switches not competing with packet routers for “daily switching”
Optical switch: Bufferless Electrical packet router: Buffered
Rev PA1Rev PA1 22
Our FPGA-Controlled Switch Test-Bed
• 16 high-speed DACs enable test
and control of integrated
photonic switch circuit
PIN shifter Thermal tuner
3dB coupler
Grating coupler array
FPGA with 16
high-speed
DACs/ADCs
On-chipLoss[dB]
1
2
3
4
5
Path category
B: Bar
C:cross
W: Waveguide
X: Crossing
Rev PA1Rev PA1 23
Transitioning to Novel Modular Architectures…
• Modular architecture and control
plane
• Avoids on chip crossings
• Fully non-blocking
• Path independent insertion loss
• Low crosstalk
[Dessislava Nikolova*, David M. Calhoun*, Yang Liu, Sébastien Rumley, Ari Novack, Tom Baehr-Jones, Michael Hochberg, Keren Bergman , Modular
architecture for fully non-blocking silicon photonic switch fabric , Nature Microsystems & Nanoengineering 3 (1607) (Jan 2017).]
Rev PA1Rev PA1 24
AIM Datacom 2nd Tapeout Run
• More efforts poured into monolithically-
integrated photonic devices
• Novel switch devices and test structures
submitted to AIM platform
Device Area
1 4x4x4 λ Space-and-
wavelength switch
1.9mm x
2.6mm
2 4x4 Si space switch 1.4mm x
2.3mm
3 4x4 Si/SiN two-layered
space switch
1.5mm x
2.3mm
4 2x2 double-gated/single-
gated ring switch
0.8mm x
1.4mm
5 Crossing and escalator
test structure
0.6mm x
1mm
6 1x2x8 λ MUX with rings 1.2mm x
0.2mm
7 1x2x4 λ MUX with micro-
disks
0.6mm x
0.2mm
8 2x2 double-gated MZM
switch
3mm x
0.4mm
Rev PA1Rev PA1 25
What optical networks are good at: bandwidth
steering
• Put your fibers where traffic is
• Example: 3D stencil traffic over Dragonfly
– 20 groups, 462 nodes per group, 24x24x16 stencil
• Per workload reconfiguration – avoids switching time overhead
• No need for ultra-large radixes – 8x8 is sufficient
[1] K. Wen, et al. “Flexfly: Enabling a Reconfigurable Dragonfly through Silicon Photonics”, SC 2016
Rev PA1Rev PA1 26
GPU3GPU1 GPU2 GPU4
CMP1 CMP2 NIC1 NIC2MEMMEM
MEMMEM
Intra-node bandwidth steering
• Emerging architectural concept:
unified interconnect fabric
– AMD’s Infinitiy fabric
– HPE’s “The Machine”
– Gen-Z
• Highly flexible!
• But involves many hops
– Latency
– Cost/power of routers
– Many chip-to-chip PHYs
Rev PA1Rev PA1 27
GPU3GPU1 GPU2 GPU4
CMP1 CMP2 NIC1 NIC2MEMMEM
MEMMEM
Intra-node bandwidth steering
• Alternative concept: use many low-
radix optical switches
– 8x8 realizable with today’s
technology
– Tens of switches can be
collocated on a single chip
• Somehow less flexible than the
packet routed counterpart
– Not all-to-all
– Reconfiguration takes
microseconds
• But transparent for packets
– Latency of point-to-point
– Energy efficient
Rev PA1Rev PA1 28
GPU3GPU1 GPU2 GPU4
CMP1 CMP2 NIC1 NIC2MEMMEM
MEMMEM
GPU3GPU1 GPU2 GPU4
CMP1 CMP2
NIC1 NIC2
MM
Conventional architecture
MM MM MM
M M
Rev PA1Rev PA1 29
GPU3GPU1 GPU2 GPU4
CMP1 CMP2 NIC1 NIC2MEMMEM
MEMMEM
GPU3GPU1
GPU2 GPU4
CMP1 CMP2
NIC1 NIC2
MEMMEM MEMMEM
GPU centric / CMPs as data-accelerators
Rev PA1Rev PA1 30
Conclusions
• Lack of bandwidth is threatening scalability
• For Exascale, need to work on (priority-sorted)
– Costs of the optical part
• Automated packaging and testing
• Increased integration
• Larger market
– Power of electrical part
• Packet routers
– Finely cost/performance-optimized
topologies [1]
• Taper optical bandwidth, but not too much
• Get as much as we can from optical cables
– Power of optical part
• Energy-wise optimized designs
• Improved laser efficiency (technology or “tricks”)
• Technological advances
– Costs of electrical part
[1] M.Y. Teh, et al. “Design space exploration of the Dragonfly topology”, Exacomm workshop (best paper), 2017
Rev PA1Rev PA1 31
Conclusions
• All-optical interconnects
• Optical switches and packets routers not directly comparable
• Bandwidth-steering based architectures should be explored
• Optical switches used in addition to regular routers
GPU3GPU1 GPU2 GPU4
CMP1 CMP2
NIC1 NIC2
MM MM MM MM
M M
GPU3GPU1
GPU2 GPU4
CMP1 CMP2
NIC1 NIC2
MEMMEM MEMMEM
Rev PA1Rev PA1 32
Thank you!

Silicon Photonics for Extreme Computing - Challenges and Opportunities

  • 1.
    Rev PA1Rev PA11 Keren Bergman Department of Electrical Engineering Lightwave Research Lab, Columbia University New York, NY, USA Silicon Photonics for Extreme Scale Computing – Challenges & Opportunities
  • 2.
    Rev PA1Rev PA12 Trends in extreme HPC • Evolution of the top10 in the last six years: – Average total compute power: • 0.86 PFlops  21 PFlops • ~24x increase – Average nodal compute power: • 31GFlops  600GFlops • ~19x increase – Average number of nodes • 28k  35k • ~1.3x increase  Node compute power main contributor to performance growth 2010 2011 2012 2013 2014 2015 2016 0.1 1 10 100 Node computer power (FLOPs) Number of nodes Total system compute power (FLOPs) Average of top 10 sytems, relative to 2010 24x 19x 1.3x [top500.org, S. Rumley, et al. Optical Interconnects for Extreme Scale Computing Systems, Elsevier PARCO 64, 2017]
  • 3.
    Rev PA1Rev PA13 Interconnect trends • Top 10 average node level evolutions: – Average node compute power: • 31GFlops  600GFlops • ~19x increase – Average bandwidth available per node • 2.7GB/s  7.8GB/s • ~3.2x increase – Average byte-per-flop ratio • 0.06 B/Flop  0.01 B/Flop • ~6x decrease • Sunway TaihuLight (#1) shows 0.004 B/Flop !!  Growing gap in interconnect bandwidth 2010 2011 2012 2013 2014 2015 2016 0.1 1 10 Node computer power (FLOPs) Node bandwidth (bit/s) Byte-per-flop ratio Average of top 10 sytems, relative to 2010 19x 3.2x 0.17x [top500.org, S. Rumley, et al. Optical Interconnects for Extreme Scale Computing Systems, Elsevier PARCO 64, 2017]
  • 4.
    Rev PA1Rev PA14 • Real Exascale goal: reaching an Exaflop… • …while satisfying constraints (20MW, $200M) • …with reasonably useful applications • Assume 15% of $ budget for interconnect: – 15% x $200M / 500 Pb/s = 6 ¢/Gb/s – Bi-directional links must thus be sold for ~10 ¢/Gb/s • Today: optical 10$/Gb/s electrical 0.1-1 $/Gb/s • Assume 15% of power budget for interconnect: – 15% x 20MW / 125 Pb/s = 24 mW/Gb/s = 24 pJ/bit = budget for communicating a bit end-to-end  6 pJ/bit per hop  4 pJ/bit for switching today ~20 pJ/bit  2 pJ/bit for transmission today ~10 pJ/bit (elec) ~20 pJ/bit (optical) Exascale interconnects – power and cost constraints 1.25 ExaFLOP X 0.01 B/FLOP = 125 Pb/s injection BW X 4 hops = 500 Pb/s installed BW
  • 5.
    5 The Photonic Opportunityfor Data Movement  Energy efficient, low-latency, high-bandwidth data interconnectivity is the core challenge to continued scalability across computing platforms  Energy consumption completely dominated by costs of data movement  Bandwidth taper from chip to system forces extreme locality Reduce Energy Consumption Eliminate Bandwidth Taper 10 100 1,000 10,000 100,000 1,000,000 Bandwidth:Gb/sec/mm Electronic Photonic 0.01 0.1 1 10 100 1000 Chip Edge PCB Rack System DataMovementEnergy (pJ/bit) Electronic Photonic
  • 6.
    Rev PA1Rev PA16 Default interconnect architecture Node Router Router Node Long distance (router-router) link Short distance (router-router) link Electrical transceivers Optical transceivers (currently 10-20% of router-router links) Router Node Short distance NR (node-router) link N compute nodes in system Node Node Node One rack
  • 7.
    Rev PA1Rev PA17 Router Core Exascale optical interconnect ? • Requirements for that to happen: – Divide cost by 1.5 orders of magnitude at least – Improve energy-efficiency by one order of magnitude at least Node Router Core Router Core Node Long distance RR link Electrical transceivers Optical transceivers Node N compute nodes in system Node Node
  • 8.
    Rev PA1Rev PA18 Exascale supercomputing node • Consider 50k nodes • Total injection bandwidth ~ 100 Pb/s • 4 ZB per year at 1% utilization • Total cumulated unidirectional bandwidth: ~ 500 Pb/s Compute power: From 10 to 30 Teraflop (TF) Interconnect bandwidth: 0.01 B/F  0.8 – 2.4 Tb/s Bulk memory bandwidth: 0.1 B/F  8 - 24 Tb/s Near memory bandwidth: 10-30 TF x 8bit x 0.5B/F = 40 - 120Tb/s Requirements for next- generation interconnects! Ideal case: Back to 0.01B/F to ensure well fed nodes
  • 9.
    Silicon Photonics: allthe parts • Silicon as core material – High refractive index; high contrast; sub micron cross-section, small bend radius. • Small footprint devices – 10 μm – 1 mm scale compared to cm-level scale for telecom • Low power consumption – Can reach <1 pJ/bit per link • Aggressive WDM platform – Bandwidth densities 1-2Tb/s pin IO Switching WDM Modulation & Demultiplexing Modulation Detection •Silicon wafer-scale CMOS – Integration, density scaling – CMOS fabrication tools – 2.5D and 3D platforms S. Rumley et al. "Silicon Photonics for Exascale Systems”, IEEE JLT 33 (4), 2015.
  • 10.
    Photonic Computing Architectures:Beyond Wires 10 • Leverage dense WDM bandwidth density • Photonic switching • Distance-independent, cut-through, bufferless • Bandwidth-energy optimized interconnects On chip Short distance PCB Long distance PCB Optical link Conventional hop-by-hop data movement Fully flattened end-to-end data movement 12 conversions! No conversion!
  • 11.
    Silicon Photonic LinkDesign clk dataTIA clk dataTIA clk clk clk data R C clk dataTIA clk gen Silicon waveguide Silicon waveguide • Co-existence of Electronics and Photonics • Energy-Bandwidth optimization Clock Distribution Serialization of Data Driver Comb Laser Vertical Grating Coupler Coupled Waveguides OOK Modulation Fiber Thermal Tuning Photodiode “0” “1” λ Deserialization Demultiplexing Filter Amplifier [1] M. Bahadori et al. “Comprehensive Design Space Exploration of Silicon Photonic Interconnects," IEEE JLT 34 (12), 2015.
  • 12.
    Rev PA1Rev PA112 Utilization of Optical Power Budget Photonic Elements From Electronics To Electronics Coupler Loss Sensitivity level of receiver Demux Array Penalty Coupler Loss Coupler Loss Modulator Array Penalty Available Laser power Maximum Available Budget per λ Max 5 dBm per λ Extra Budget Maximum link utilization
  • 13.
    Rev PA1Rev PA113 All-Parameter Optimization: Min Energy Design Photonic Elements From Electronics To Electronics Optimal design based on physical dimensions Breakdown of consumption for given bandwidth 1 pJ/bit
  • 14.
    Rev PA1Rev PA114 3 5 10 15 20 25 Optical channel ( ) linerate [Gb/s] 0 1 2 3 Linkenergyefficiency[pJ/bit] 10% Laser eff. 30% Laser eff. Highly parallel, “SERDES-less” links • Investigate “many-channel” architectures with low bitrate wavelengths – Leads to poor laser utilization • Only possible with high laser efficiency – But may allow drastic simplification of drivers and SERDES blocks 400G
  • 15.
    Rev PA1Rev PA115 0 10 20 30 0 4 8 12 Cost per bandwidth – declining but slowly • Today (2017): – 100G (EDR) best $/Gb/s figure – Copper cable have shorter reaches due to higher bit-rate – Optics: Not even ½ order of magnitude price drop over 4 years – But electrical-optical gap is shrinking [Besta et al. “Slim Fly: A Cost Effective Low-Diameter Network Topology”, SuperComputing 2014] [may 2017] Mellanox EDR 100 Gb/s Length [m] Cost[$/Gb/s]
  • 16.
    Rev PA1Rev PA116 Packaging and connector drive costs [Molex] [Barwicz/IBM, OFC 2017]
  • 17.
    Rev PA1Rev PA117 Toward integrated optical transceivers • Fundamental step to reduce cost and power of optics: co-integration Node Router Core Router Core Node Long distance RR link Electrical transceivers Optical transceivers Router Core Node Short distance NR link
  • 18.
    Rev PA1Rev PA118 Towards integrated, general-purpose optical transceivers Node Router Core Router Core Node Electrical transceivers Integrated optical transceivers Router Core Node Short distance NR link • Fundamental step to reduce cost and power of optics
  • 19.
    Rev PA1Rev PA119 Cost vs. energy • Transceiver procurement cost dominates TCO – Same energy/procurement ratio as Cori (30% / 70%) with 0.4 $/Gb/s • This is a factor of 10 from the current cost figure Hypotetical 0.1$/Gb/s transceivers (8 years) Hypotetical 1$/Gb/s transceivers (8 years) Typical 100G transceivers (8 years, 2.5W, $500) Titan (6 years, 8.2 MW, M$97) Cori (8 years, 3.9 MW, M$100) Roadrunner (5 years, 2.35 MW, M$100) 0% 20% 40% 60% 80% 100% 4 8 Desktop PC (5 years, 100W, $2000) Cost of energy over lifetime ($100/MWh) Procurement cost
  • 20.
    Rev PA1Rev PA120 The bright side of optical switching • Optical switches can have astonishing aggregate bandwidths – Absence of signal introspection – Possibility to receive dense WDM signals (> 1 Tb/s) on each port • Translates into alluring $/Gb/s figures:  Optical switching >10 cheaper than electrical packet routing… Type # ports Bandwidth per port Total bandwidth Price Price per Gb/s Calient S320 Mems switch 320 ports 400Gbps (with 16 wavelengths at 25G – CDAUI-16 signaling) 128 Tb/s $40k 0.3 $/Gb/s Mellanox SB7790 36 ports 100 Gbps (4xEDR Infiniband) 3.6 Tb/s $12k 3.3 $/Gb/s
  • 21.
    Rev PA1Rev PA121 Optical switch vs. Electrical packet router • (Random access) buffering plays a crucial role. – Without buffering, end-to-end scheduling is required – With buffering, scheduling made link-after-link – Without buffering, no back-to-back transmissions • Packet routers also allow differentiated QoS, error correction, etc.  Packet router offers much higher “value” than optical switches My personal guess: At least 30x more value (for 10ns optical switching time) Optical switches not competing with packet routers for “daily switching” Optical switch: Bufferless Electrical packet router: Buffered
  • 22.
    Rev PA1Rev PA122 Our FPGA-Controlled Switch Test-Bed • 16 high-speed DACs enable test and control of integrated photonic switch circuit PIN shifter Thermal tuner 3dB coupler Grating coupler array FPGA with 16 high-speed DACs/ADCs On-chipLoss[dB] 1 2 3 4 5 Path category B: Bar C:cross W: Waveguide X: Crossing
  • 23.
    Rev PA1Rev PA123 Transitioning to Novel Modular Architectures… • Modular architecture and control plane • Avoids on chip crossings • Fully non-blocking • Path independent insertion loss • Low crosstalk [Dessislava Nikolova*, David M. Calhoun*, Yang Liu, Sébastien Rumley, Ari Novack, Tom Baehr-Jones, Michael Hochberg, Keren Bergman , Modular architecture for fully non-blocking silicon photonic switch fabric , Nature Microsystems & Nanoengineering 3 (1607) (Jan 2017).]
  • 24.
    Rev PA1Rev PA124 AIM Datacom 2nd Tapeout Run • More efforts poured into monolithically- integrated photonic devices • Novel switch devices and test structures submitted to AIM platform Device Area 1 4x4x4 λ Space-and- wavelength switch 1.9mm x 2.6mm 2 4x4 Si space switch 1.4mm x 2.3mm 3 4x4 Si/SiN two-layered space switch 1.5mm x 2.3mm 4 2x2 double-gated/single- gated ring switch 0.8mm x 1.4mm 5 Crossing and escalator test structure 0.6mm x 1mm 6 1x2x8 λ MUX with rings 1.2mm x 0.2mm 7 1x2x4 λ MUX with micro- disks 0.6mm x 0.2mm 8 2x2 double-gated MZM switch 3mm x 0.4mm
  • 25.
    Rev PA1Rev PA125 What optical networks are good at: bandwidth steering • Put your fibers where traffic is • Example: 3D stencil traffic over Dragonfly – 20 groups, 462 nodes per group, 24x24x16 stencil • Per workload reconfiguration – avoids switching time overhead • No need for ultra-large radixes – 8x8 is sufficient [1] K. Wen, et al. “Flexfly: Enabling a Reconfigurable Dragonfly through Silicon Photonics”, SC 2016
  • 26.
    Rev PA1Rev PA126 GPU3GPU1 GPU2 GPU4 CMP1 CMP2 NIC1 NIC2MEMMEM MEMMEM Intra-node bandwidth steering • Emerging architectural concept: unified interconnect fabric – AMD’s Infinitiy fabric – HPE’s “The Machine” – Gen-Z • Highly flexible! • But involves many hops – Latency – Cost/power of routers – Many chip-to-chip PHYs
  • 27.
    Rev PA1Rev PA127 GPU3GPU1 GPU2 GPU4 CMP1 CMP2 NIC1 NIC2MEMMEM MEMMEM Intra-node bandwidth steering • Alternative concept: use many low- radix optical switches – 8x8 realizable with today’s technology – Tens of switches can be collocated on a single chip • Somehow less flexible than the packet routed counterpart – Not all-to-all – Reconfiguration takes microseconds • But transparent for packets – Latency of point-to-point – Energy efficient
  • 28.
    Rev PA1Rev PA128 GPU3GPU1 GPU2 GPU4 CMP1 CMP2 NIC1 NIC2MEMMEM MEMMEM GPU3GPU1 GPU2 GPU4 CMP1 CMP2 NIC1 NIC2 MM Conventional architecture MM MM MM M M
  • 29.
    Rev PA1Rev PA129 GPU3GPU1 GPU2 GPU4 CMP1 CMP2 NIC1 NIC2MEMMEM MEMMEM GPU3GPU1 GPU2 GPU4 CMP1 CMP2 NIC1 NIC2 MEMMEM MEMMEM GPU centric / CMPs as data-accelerators
  • 30.
    Rev PA1Rev PA130 Conclusions • Lack of bandwidth is threatening scalability • For Exascale, need to work on (priority-sorted) – Costs of the optical part • Automated packaging and testing • Increased integration • Larger market – Power of electrical part • Packet routers – Finely cost/performance-optimized topologies [1] • Taper optical bandwidth, but not too much • Get as much as we can from optical cables – Power of optical part • Energy-wise optimized designs • Improved laser efficiency (technology or “tricks”) • Technological advances – Costs of electrical part [1] M.Y. Teh, et al. “Design space exploration of the Dragonfly topology”, Exacomm workshop (best paper), 2017
  • 31.
    Rev PA1Rev PA131 Conclusions • All-optical interconnects • Optical switches and packets routers not directly comparable • Bandwidth-steering based architectures should be explored • Optical switches used in addition to regular routers GPU3GPU1 GPU2 GPU4 CMP1 CMP2 NIC1 NIC2 MM MM MM MM M M GPU3GPU1 GPU2 GPU4 CMP1 CMP2 NIC1 NIC2 MEMMEM MEMMEM
  • 32.
    Rev PA1Rev PA132 Thank you!