Ryousei(Takano(
Na#onal'Ins#tute'of'Advanced'Industrial'Science'and'Technology'(AIST)'
Tomohiro(Kudoh(
University'of'Tokyo'
“Majorca(at(MIT”(Workshop@Boston,(28(July(2015
Flow5centric(Compu:ng(
(
A(Datacenter(Architecture(in(the(Post5Moore(Era
IMPULSE: Initiative for Most Power-efficient
Ultra-Large-Scale data Exploration
2014 2020 2030
Op@cal(
Network
3D stacked package2.5D stacked packageSeparated packages
Future data center
Logic I/O
NVRAM
Logic
NVRAM
I/O
I/O
Logic
NVRAM
High-Performance Logic Architecture
Non-Volatile Memory Optical Network
- Voltage-controlled, magnetic RAM
mainly for cache and work memories
- 3D build-up integration of the front-end
circuits including high-mobility Ge-on-
insulator FinFETs. / AIST-original TCAD
-  Silicon photonics cluster SW
-  Optical interconnect technologies
-  Future data center architecture
design / Flow-centric computing
Outline
•  Data(centers(in(the(postGMoore(era(
–  Issues:(CMOS(scaling,(Data(movement(
–  Approach:(integra@on(of(heterogeneous(task(
specific(processors(into(a(single(system(
–  Expecta@on(for(op@cal(network(
•  Proposed(architecture:(
FlowGcentric(compu@ng
•  CMOS(scaling(is(ending(and(“Free(lunch(is(over”(
–  CPU(performance,(amount(of(memory(
The(Post5Moore(Era(
Home (/) >  Statistics (/statistics/) >  Performance Development
PERFORMANCE DEVELOPMENT
Exponential growth of supercomputing power as recorded by the TOP500 list.
Note that you can zoom by holding the left mouse button over the charts. To reset to original chart, click on the right mouse button.
Performance Development
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
100 MFlop/s
1 GFlop/s
10 GFlop/s
100 GFlop/s
1 TFlop/s
10 TFlop/s
100 TFlop/s
1 PFlop/s
10 PFlop/s
100 PFlop/s
1 EFlop/s
10 EFlop/s
Lists
Performance
TOP500(Performance(Development(
Performance(
source:(hXp://top500.org/(
To(keep(sustainable(performance(scaling,((
a(clean(slate(architecture(is(required.(
The(end(of(Moore’s(law?(
sum(
#1(
#500(
The(Post5Moore(Era(
•  The(key(of(“more(than(Moore”(is(diversifica:on(
–  Mul@ple(kinds(of(task(typeGspecific(processing(units(
• cf.(GPGPU,(FPGA,(neuromorphic(chip,(etc(
–  Integra@on(and(u@liza@on(of(heterogeneous(task(
specific(processing(units(in(a(system(is(the(key.(
Source:(Fig.3(of(“MoreGthanGMoore(
White(Paper,”(ITRS(white(paper,(2010
Data(Movement(Issue(
•  Data(movement(is(the(boXleneck(to(
performance(
–  Memory(wall,(Interconnect(wall,(and(Power(wall(
–  E.g.,(K(computer([Byte/FLOP]:(
• Memory:(0.5(
• Interconnect:(0.15625(
•  Current(solu@on:(Data(affinity(processing((DAP)(
Data(Affinity(Processing((DAP)
•  Avoid(moving(data,(do(processing(at(the(place(data(are.(
–  Cost(of(moving(data(has(been(higher(than(running(different(
algorithms/programs(on(a(processor(
–  Assumes(versa@le(processors(that(can(perform(mul@ple(
func@ons.(Some@mes,(a(square(peg(in(a(round(hole.(
•  S@ll,(some(communica@ons(among(processors(are(
required(if(there(are(mul@ple(processors.(
•  For(larger(data(sets,(amount(of(memory(deployable(
near(a(processor(limits(the(effec@veness(of(DAP.(
•  DAP(is(not(suitable(for(processing(of(streaming(data.(
*
Revolu:onary(high(speed(communica:on(
will(change(the(scenario
If(data(moving(cost(is(lower((data(can(be(moved(
in(a(short(@me(with(low(energy(consump@on):(
–  Can(move(data(to(processing(units(appropriate(for(
performing(required(process.(
–  Can(u@lize(heterogeneous(processing(units(
–  Can(virtually(increase(the(amount(of(memory.(
Data(on(a(memory(near(a(processor(can(be(
swapped(faster(than(the(data(are(processed(by(
the(processor.((
Our(Approach
Combine(fineGgrained(task(specific(processors(in(
a(pipeline(manner,(instead(of(general(purpose(
CPUs(with(large(memory(
(a)(general(purpose(CPU( (b)(heterogeneous(task(specific(
processors(
input(
data(
output(
data(
C 3 BA1A
A 1A1
C 1A1 A
3 BA1A
Expecta:ons(for(Op:cal(Networks(
•  Such(a(pipeline(requires(huge(bandwidth(
between(processors(
•  Expecta@ons(for(op@cal(network(
–  Interconnect(over(DWDM(
–  Direct(op@cal(I/O(connec@on(to(memory(
memory( memory(
Interconnect(
over(DWDM(
Direct(op@cal(I/O(
connec@on(to(memory(
Flow5centric(Compu:ng(
•  Disaggregate(server(components(
•  Reassemble(a(slice(based(on(a(data(flow(
–  Slice(is(a(“virtual”(server(consists(of(processors(and(SCM(
connected(through(an(op@cal(path(network((all(op@cal(
path(between(endGtoGend)(
Real-time
Big data
Optical Network
Storage class memory
GPU GPU GPU FPGA FPGA
Flow5centric(Compu:ng(
•  To(take(advantage(of(op@cal(path(network,((
a(totally(new(data(center(OS(is(essen@al(
•  Flow(OS(is(the(control(plane(of(a(data(center(
Real-time
Big data
Op#mal'flow'planning''
and'slice'provisioning
Optical Network
Storage class memory
GPU GPU GPU FPGA FPGA
Resource'management'/'
'Monitoring
Hardware(Image(
Switch DPC(
(CPU)
DPC(
(GPU)
DPC(
(FPGA)
SCM(
PU MEM PU MEM
PU MEM PU MEM
PU MEM PU MEM
PU MEM PU MEM
•  ProcessorGmemory(embedded(
package(with(WDM(
interconnects(
•  Direct(op@cal(I/O(connec@ons(
to(memory(modules(
distributed(on(a(chip(
(
Communica@on(
over(DWDM(
Op:cal(Network(in(DCs
•  New(data(center(architecture(with(a(similar(
concept(have(been(proposed:(
“disaggrega:on”,(and(“data(center(in(a(box”.(
•  Op:cal(network(is(key(to(drive(innova@on(in(
future(data(centers.(
UC Berkeley
1 Terabit/sec optical fibers
FireBox Overview!
High Radix
Switches
SoC
SoC
SoC
SoC
SoC
SoC
SoCSoC
SoC
SoC
SoC
SoC
SoC
SoC
SoC
SoC
Up to 1000 SoCs +
High-BW Mem
(100,000 core total)
NVM
NVM
NVM
NVM
NVM
NVM
NVM
NVMNVM
NVM
NVM
NVM
NVM
NVM
NVM
NVM
Up to 1000 NonVolatile
Memory Modules (100PB total)
InterXBox&
Network&
Many&Short&Paths&
Thru&HighXRadix&Switches&
38 A 5 EIA
Summary
•  Op:cal(network(can(change(the(game(in(the(
post(Moore(era.(
•  We(have(proposed(flow5centric(compu:ng,((
a(disaggregated(data(center(architecture(
focusing(on(data(flows.(
–  Data(processing(with(heterogeneous(integra@on(of(special(
purpose(processors(communica@ng(through(a(huge(
bandwidth(op@cal(network(

Flow-centric Computing - A Datacenter Architecture in the Post Moore Era