Expectations for optical network from the viewpoint of system software research

Ryousei Takano
National Institute of Advanced Industrial Science and Technology
(AIST)
Special session on challenges and opportunities of integrated
photonics in future datacenters
ACSI 2015@Tsukuba, 27 Jan. 2015
Expectations for optical network
from the viewpoint of system
software research

Outline
•  Trends in datacenter research and
development
•  AIST IMPULSE Project
•  Workload analysis
•  Proposed architecture:
Dataﬂow-‐‑‒centric computing
2

Introduction
•  BigData is a killer app in datacenters, it
requires a clean slate architecture like
“disaggregation”, “datacenter in a box”.
•  Optical network is key to making them.
•  Optical path network (all optical path
between end-‐‑‒to-‐‑‒end) in a datacenter
–  Pros: huge bandwidth, energy eﬃciency
–  Cons: path switching latency, utilization
•  To take advantage of optical path network,
a new datacenter OS is essential.
–  Key idea: control/data plane separation
3

Optical Network in DCs
•  Similar concept (“disaggregation” or “datacenter
in a box”) projects are launched recently.
–  Open Compute Project (Facebook)
–  Rack Scale Computing (Intel)
–  Extremely Shrinking Computing (IBM)
–  The Machine (HP)
–  FireBox (UCB)
–  CTR Consortium (MIT)
•  Optical network, including photonic-‐‑‒electronic
convergence and short (<1km) reach inter-‐‑‒
connection, is key to drive innovation in future
datacenters.
4

5
Architecture
Service Cluster Back-End Cluster
Front-End Cluster
Web
250 racks
Ads 30 racks
Cache (~144TB)
Search Photos Msg Others UDB ADS-DB TaoLeader
Multifeed 9 racks
Other small services
“Flash at Facebook”,
Flash Summit 2013
Standard
Systems
I
Web
III
Database
IV
Hadoop
V
Photos
VI
Feed
CPU
High
2&x&E5*2670
High
2&x&E5*2660
High
2&x&E5*2660
Low
High
2&x&E5*2660
Memory Low
High
144GB
Medium
64GB
Low
High
144GB
Disk Low
High&IOPS
3.2&TB&Flash
High
15&x&4TB&SATA
High
15&x&4TB&SATA
Medium
Services Web,&Chat Database
Hadoop
(big&data)
Photos,&Video
MulPfeed,
Search,&Ads
Five Standard Servers

Open Compute Project
6
•  OCP was founded to openly share designs of
datacenter products by Facebook in April 2011.
•  Shift from commodity products to user-‐‑‒driven
design to improve the energy eﬃciency of large
scale datacenters
–  Industry Standard: 1.9 PUE
–  Open Compute Project: 1.07 PUE
•  Speciﬁcations: server, storage,
rack, network switch, etc.
•  Products: Quanta Rackgo X,
GIGABYTE DataCenter Solution
Open Compute Rack v2

HP “The Machine”
8
The
Machine
could
be
six
%mes
more
powerful
than
an

equivalent
conven2onal
design,
while
using
just
1.25

percent
of
the
energy
and
being
around
1/100
the
size.
h:p://www.hpl.hp.com/research/systems-‐research/themachine/

Datacenter in a Box
•  The machine is six times faster with 1.25 percent
of the energy comparing with K.
9
HPC Challengeʼ’s RandomAccess benchmark

The Machine: Architecture
10
Photonic
Interconnect
Compute Elements
Memory Elements
NV Memory Elements
Storage Elements
Architecture evolution/revolution
“Computing Ensemble”: bigger than a
server, smaller than a datacenter,
built-in system software
– Disaggregated pools of uncommitted
compute, memory, and storage
elements
– Optical interconnects enable dynamic,
on-demand composition
– Ensemble OS software using
virtualization for composition and
management
– Management and programming
virtual appliances add value for IT
and application developers
On-demand composition
Ensemble OS Management
Ensemble Programming

Machine OS
•  Linux++: Linux-‐‑‒based OS for The Machine
–  A new concept of memory management
–  An emulator to make a conventional computer
behave like The Machine
–  A developerʼ’s preview released in June 2015?
•  Carbon
–  HP will replace Linux++ with Carbon.
11

UC Berkeley
1 Terabit/sec optical fibers
FireBox Overview!
High Radix
Switches
SoC
SoC
SoC
SoC
SoC
SoC
SoCSoC
SoC
SoC
SoC
SoC
SoC
SoC
SoC
SoC
Up to 1000 SoCs +
High-BW Mem
(100,000 core total)
NVM
NVM
NVM
NVM
NVM
NVM
NVM
NVMNVM
NVM
NVM
NVM
NVM
NVM
NVM
NVM
Up to 1000 NonVolatile
Memory Modules (100PB total)
InterXBox&
Network&
Many&Short&Paths&
Thru&HighXRadix&Switches&
FireBox Overview
12
A similar concept of
The Machine

UC Berkeley Photonic-Switches-
!  Monolithically&integrated&silicon&photonics&with&WaveXDivision&
MulCplexing&(WDM)&
-  A&fiber&carries&32&wavelengths,&each&32Gb/s,&in&each&direcCon&
-  OffXchip&laser&opCcal&supply,&onXchip&modulators&and&detectors&
!  MulCple&radixX1000&photonic&switch&chips&arranged&as&middle&
stage&of&Clos&network&(first&and&last&Clos&stage&inside&sockets)&
!  2K&endpoints&can&be&configured&as&either&SoC&or&NVM&modules&
!  In&Box,&all&paths&are&two&fiber&hops:&
-  ElectricalXphotonic&at&socket&
-  One&fiber&hop&socketXtoXswitch&
-  PhotonicXelectrical&at&switch&
-  Electrical&packet&rouCng&in&switch&
-  ElectricalXphotonic&at&socket&
-  One&fiber&hop&switchXtoXsocket&
-  PhotonicXelectrical&at&socket&
30
SoC&
Switch&
Switch&
SoC&
NVM&
13
Electrical packet switching

IMPULSE: Initiative for Most Power-efficient
Ultra-Large-Scale data Exploration
2014
2020
2030
・・・
・・・
Op2cal

Network
3D stacked package
2.5D stacked package
Separated packages
Future data center
Logic
I/O
NVRAM
Logic
NVRAM
I/O
I/O
Logic
NVRAM
High-Performance Logic Architecture
Non-Volatile Memory Optical Network
- Voltage-controlled, magnetic RAM
mainly for cache and work memories
- 3D build-up integration of the front-end
circuits including high-mobility Ge-on-
insulator FinFETs. / AIST-original TCAD

-  Silicon photonics cluster SW
-  Optical interconnect technologies
-  Future data center architecture
design / Dataflow-centric warehouse-
scale computing

15
HPC
Big data
Architecture for concentrated data processing
3D installation
Non-‐‑‒volatile
memory
Energy-‐‑‒
saving
logic
Optical path
between chips
Non-‐‑‒volatile
memory Energy-‐‑‒
saving
logic
Storage class memory（non-‐‑‒volatile memory）
HDD storage
High performance
server module
Energy-‐‑‒saving
high-‐‑‒speed network
Energy-‐‑‒saving
large-‐‑‒capacity
storage
Creating a rich and
eco-‐‑‒friendly society
Initiative for Most Power-‐‑‒eﬃcient Ultra-‐‑‒Large-‐‑‒Scale data Exploration
IMPULSE STrategic AIST integrated R&D (STAR) program
＊STAR program is AIST research that will produce a large outcome in the future.
AIST’s IMPULSE Program

Voltage-controlled Nonvolatile Magnetic RAM
Nonvolatile CPU
Nonvolatile Cash
Nonvolatile Display
Power saved
storage
NAND Flash
Voltage Controlled
Spin RAM
•  voltage-induced magnetic
anisotropy change
•  Less than 1/100 rewriting power
•  Resistance change by the Ge
displacement
•  Loss by entropy: < 1/100
Voltage Controlled
Topological RAM
Memory keeping
w/o power
Insulation
Layer
Thin film
Ferro-
magnetics

Low Power High-performance Logic
Wiring
layer
Front
-end
Front-end 3D integration
ソースドレインソースドレイン
nMOS pMOS
絶縁膜
Ge Ge Ge
Ge Fin CMOS Tech.
•  Low-power/high-speed by Ge
•  Toward 0.4V - Ge Fin CMOS
S
D
S
D
Insulation layer
● Dense integration w/o miniaturization
● Reduction of the wiring length for power saving
● Introduction of Ge and III-V channels by simple stacking process
● Innovative circuit by using Z direction

Simulations
Anisotropic
magnetic
material
First-principle sim.
TCAD
- Large-scale simulation for
3D structure, novel devices,
and latest material

Simulation for temperature
distribution
-  Clarification for
voltage-control
-  Parameter
setting to the
TCAD
Phase-change

DEMUX
Wavelength
bank
(Optical
comb)
・
・
・
MOD. MUX
Fiber
Silicon
Photonics
Integration
Datacenter
server
racks
Silicon photonics
cluster switches
DWDM, multi-level
modulation optical
interconnects
DSP
Tx RxComb
source
Memory
cube
CPU
/GPU
2.5D-CPU Card
No of λs
Order of mod.
Bit rate
1
1
20 Gbps
4
8
640 Gbps
32
8
5.12 Tbps
●Large-scale silicon photonics based cluster switches
●DWDM, multi-level modulation, highly integrated “elastic” optical interconnects
●Ultra-low energy consumption network by making use of optical switches
Ø  Ultra-compact switches
based on silicon photonics
Ø  3D integration by
amorphous silicon
Ø  A new server architecture
Current state-of-the-art Tx
100Gbps → ～ 5.12Tbps
Current electrical switches：
～130Tbps → ～500Pbps
Optical Network Technology for Future Datacenters

Architecture for Big Data and Extreme-scale Computing
Real-time
Big data
Optimal arrangement of the data flow
Resource management / Monitoring
Storage
Server Module
Data center OS
Input
Output
Conv.
Ana.
Data flow
Data
ﬂow
centric
warehouse
scale
compu%ng
1 - Single OS controls
entire data center
2 - Split a data center OS into the
data plane and the control plane to
guarantee real-time data processing
Connect to universal processor /
hardware and storage by using
optical network

Performance Estimation
•  Estimate the performance of typical both HPC and
BigData workloads on a future datacenter system
•  SimGrid simulator
–  Simulator of large-‐‑‒scale distributed Systems, such as
Grids, Clouds, HPC, and P2P.
21
http://simgrid.gforge.inria.fr
mGrid Overview
MSG
Simple application-
level simulator
SimDag
Framework for
DAGs of parallel tasks
applications on top of
a virtual environment
Library to run MPI
SMPI
virtual platform simulator
SURF
Contrib
Grounding features (logging, etc.), data structures (lists, etc.) and portability
XBT
TRACETracingsimulation
User Code
Grid user APIs
If your application is a DAG of (parallel) tasks ; use SimDag
To study an existing MPI code ; use SMPI
SimGrid is not a Simulator
logs
stats
visu
Availibility
Changes
Platform
Topology
Application
Deployment
Simulation Kernel
Application
Simulator
OutcomesScenario
Applicative
Workload
Parameters
Input
That’s a Generic Simulation Framework
Da SimGrid Team SimGrid User 101 Introduction Installing MSG Java lua Ruby Trace Conﬁg xbt Performance CC 23/28

Workload 1: Simple Message Passing
•  Iteration of neighbor communication (bottom left)
•  Big impact of increasing link bandwidth if an
application is network intensive.
22
Compute power
(FLO)
Relative
execution time
Data size (byte)
#node: 10000
Link bandwidth: 0.1, 1, 10Tbps
Link latency 100ns
CPU power: 10TFLOPS
Data size：1012〜1024B
…
…
1/100

Workload 2: HPC Application
•  NAS Parallel Benchmark (256 procs, class C)
–  Low latency is more important than huge bandwidth.
–  The problem size is too small to utilize huge bandwidth.
23
CPU power：1TFLOPS
Link bandwidth：1Tbps
CPU power：1TFLOPS
Link latency：0.1us
Eﬀect of reducing the link latency Eﬀect of increasing the link bandwidth
Relative execution time
Relative execution time

Workload 3: MapReduce
24
•  KDD Cup 2012, Track 2: predict the click-‐‑‒through
rate of ads (using Hadoop and Hivemall)
•  Machine learning is CPU intensive.
•  The eﬀect of huge bandwidth is limited, because...
–  The concurrency of the used model is not enough.
–  Hadoop is optimized to make jobs run faster on the
current I/O devices.
Disk I/O bandwidth (Mbps)
Execution time (second)
Execution time (second)
Relative CPU power (base 172GFLOPS)
CPU power: 17.2TFLOPS
Disk bandwidth: 200Mbps
Network bandwidth: 10Gbps

Memory
Optical I/O
Logic
Board
Chassis
・・・
10TB
1TB/s
100GB/s
1TB/s
100 TFLOPS
・・・
Data
Data Movement Problems
25
Memory
Electrical I/O
Logic
Board
Chassis
・・・
Current* Future
128GB
14.9GB/s
(DDR3)
7GB/s
(IB FDR)
224*2 GFLOPS
1TB/s
・・・
8GB/s
(PCIe G3)
* AIST Super Green Cloud (ASGC)
Data
Bottle
neck
Bottle
neck

Parallelization inside Package
Memory
Logic
10TB
1TB/s
100 TFLOPS
•  Need eﬃcient
parallelized
structure
•  On-‐‑‒chip
network
•  Interconnection

Bottlenecks in MapReduce
27
Map
Map
Map
Map
Reduce
Reduce
Reduce
Reduce
1. Disk
Throughput
2. Network
Congestion
3. Serialized
Reduce
Shuffle

In-‐‑‒storage, network processing
28
Mappers
Shuffle and Reduce
Hierarchical and partial
reduce processing in
each network node to
avoid network
congestion and
serialized reduce.
Compute modules are
attached in storage to
maximize the read
throughput from
storage.

In-‐‑‒storage, network processing
29
Mappers
Shuffle and Reduce
Processing
Component
Optical
Interconnect
NVRAM

30
PU
MEM
PU
MEM
PU
MEM
PU
MEM
PU
MEM
PU
MEM
PU
MEM
PU
MEM
Direct optical I/O connection
to non-‐‑‒volatile memory
modules distributed on a chip
Communication
over DWDM
In-‐‑‒storage, network processing:
hardware design

Direct Memory Copy over DWDM
•  Assume processor-‐‑‒
memory embedded
package with WDM
interconnect.
•  To fully utilize the huge
I/O bandwidth realized
by DWDM.
•  Multiple memory blocks
can be sent/received
simultaneously using
multiple wavelengths.
•  Memory-‐‑‒centric network
is a similar idea [PACT13]
31
Processor
Memory block
Processor
cores
Memory Bank
WDM
Interconnect
Memory block
Single package compute node
Cache
/
MMU
From
Wavelength bank

32
2014 2020 2030
・・・
・・・
Optical

Network
3D stacked package2.5D stacked packageSeparated packages
Future data center
Logic I/O
NVRAM
Logic
NVRAM
I/O
I/O
Logic
NVRAM
Goal: 100x energy eﬃciency of data
processing
Our Vision of Future Datacenter

DPF
DPF
DPF
DPF
Data Flow
Application
Datacenter OS
Data ﬂow
planning
Slice
DPCs
Dataﬂow Processing System
33
DPF: Data Processing
Function
DPC: Data Processing
Component
Optical Network
Storage
Server modules
Co-‐‑‒allocate DPCs
and network path
between them.
Resource
monitoring

IMPULSE Datacenter OS
•  A single OS for datacenter-‐‑‒wide optimization of
the energy eﬃciency and the performance
•  Separation of data plane and control plane:
-  Data plane is an application speciﬁc library OS.
-  Control plane manages resources (server, network, etc).
Datacenter OS
App.
Data-‐‑‒
plane
App.
Data-‐‑‒
plane
App.
Data-‐‑‒
plane
…
deploy/launch/destroy/
monitor
Control-‐‑‒plane Control-‐‑‒plane
34

IMPULSE Datacenter OS
Data plane
•  Application speciﬁc library OS
-  E.g., machine learning, data store, etc.
•  Mitigate the OS overhead to fully
utilize high performance devices.
App.
Data-‐‑‒
plane
App.
Data-‐‑‒
plane
App.
Data-‐‑‒
plane
Control-‐‑‒
plane
CPU CPU GPU
Mem I/O Mem I/O Mem I/O
Control plane
•  Resource management
•  Logical and secure resource
partitioning for data planes
•  Running on the ﬁrmware
35

Related Work
•  Datacenter-‐‑‒wide resource management
–  OpenStack, Apache CloudStack, Kubernetes
–  Hadoop YARN, Apache Mesos
•  Dataﬂow processing engine
–  Google Cloud Dataﬂow
–  Lambda architecture
•  Control and data planes separated design in
OS
–  Arrakis (U. Washington)
–  IX (Stanford)
36

Summary
•  New visions of future datacenters:
“disaggregation” and “datacenter in a box”
•  Optical network is key to making them.
•  Hardware and software co-‐‑‒design is critical.
•  Optical path network encourages C/D
separation in a datacenter OS.
–  Control plane manages resources and establishes
a path between data processing components.
–  Data plane fully utilizes the huge bandwidth.
37

Join us!
Thanks for your attention
38

Reference
•  Rack scale architecture for Cloud, IDC2013
–  https://intel.activeevents.com/sf13/connect/ﬁleDownload/session/
6DE5FDFBF0D0854E73D2A3908D58E1E2/SF13_̲CLDS001_̲100.pdf
•  Intel rack scale architecture overview, Interop2013
–  http://presentations.interop.com/events/las-‐‑‒vegas/2013/free-‐‑‒
sessions-‐‑‒-‐‑‒-‐‑‒keynote-‐‑‒presentations/download/463
•  New technologies that disrupt our complete ecosystem and
their limits in the race to Zettascale, HPC2014
–  http://www.hpcc.unical.it/hpc2014/pdfs/demichel.pdf
•  HPが「Tech Power Club」で⾒見見せた“未来のサーバー技術”,
ASCII.jp
–  http://ascii.jp/elem/000/000/915/915508/
39

40
Optical Interconnect:
As the bandwidth demand for traditionally electrical wireline interconnects has accelerated, optics has become an increa
alternative for interconnects within computing systems. Optical communication offers clear benefits for high-speed an
interconnects. Relative to electrical interconnects, optics provides lower channel loss. Circuit design and packaging techn
traditionally been used for electrical wireline are being adapted to enable integrated optical with extremely low power
resulted in rapid progress in optical ICs for Ethernet, backplane and chip-to-chip optical communication. ISSCC 201
dimensional (12×5) optical array achieving an aggregate data-rate of 600Gb/s [8.2]. Pre-emphasis using group-delay
the useful date rate of a 25Gb/s VCSEL to 40Gb/s [8.9]. Additional examples of low-power-linear and non-linear e
electronic dispersion compensation in multi-mode and long-haul cables [8.1, 8.3].
Concluding Remarks:
Continuing to aggressively scale I/O bandwidth is both essential for the industry and extremely challenging. Innovatio
higher performance and lower power will continue to be made in order to sustain this trend. Advances in circuit architectu
topologies, and transistor scaling are together changing how I/O will be done over the next decade. The most exciting and
of these emerging technologies for wireline I/O will be highlighted at ISSCC 2014.
Per-pin data-rate vs. year for a variety of common I/O standards.
2000 2002 2004 2006 2008 2010 2012 2014 2016
1
10
50
Year
Data Rate [Gbps]
HyperTranport
QPI
PCIe
S‐ATA
SAS
OIF/CEI
PON
Fibre Channel
DDR
GDDR
DRAM Data bandwidth trends.
Non-Volatile Memories (NVMs):
the past decade, significant investment has been put into emerging memories to find an alternative to floating-gate based non-volatile
emory. The emerging NVMs, such as phase-change memory (PRAM), ferroelectric RAM (FeRAM), magnetic spin-torque-transfer (STT-
RAM), and Resistive memory (ReRAM), are showing potential to achieve high cycling capability and lower power per bit in read/write
perations. Some commercial applications, such as cellular phones, have recently started to use PRAM, demonstrating that reliability
nd cost competitiveness in emerging memories is becoming a reality. Fast write speed and low read-access time are the potential
enefits of these emerging memories. At ISSCC 2014, a high-density ReRAM with a buried WL access device is introduced to improve
e write performance and area. The next Figure highlights how MLC NAND Flash write throughput continues to improve. However,
hile the Figure following shows no increase in NAND Flash density over the past year, recent devices are built with finer dimensions
more sophisticated 3-dimensional vertical bit cells.
Per-‐‑‒pin data rate of common I/O standards
High Bandwidth Memory
ISSCC2014 Trends
ISSCC2014 Trends
Data source:
http://cpudb.stanford.edu/
2x/1.5yrs
Processor scaling trends

41
http://www.linuxfoundation.org/collaborate/workgroups/networking/kernel_̲ﬂow
Linux kernel is highly complex
64byte TCP Echo
A.Belay “IX: A Protected Dataplane Opera3ng
System for High Throughput and Low Latency”,
OSDI2014
Hardware Linux
Latency 10μs 480μs
Request
per second
8.8 M 1 M
Datacenter OS is not the federation
of a general purpose OS like Linux.

Expectations for optical network from the viewpoint of system software research

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Expectations for optical network from the viewpoint of system software research

Similar to Expectations for optical network from the viewpoint of system software research (20)

More from Ryousei Takano

More from Ryousei Takano (19)

Recently uploaded

Recently uploaded (20)

Expectations for optical network from the viewpoint of system software research