SlideShare a Scribd company logo
From Classical to
Runtime Aware
Architectures
Madrid, 25 Abril 2017 Workshop Syec 25-26 April
Prof. Mateo Valero
BSC Director
Cursos de Postgrado
Technological Achievements
Transistor (Bell Labs, 1947)
• DEC PDP-1 (1957)
• IBM 7090 (1960)
Integrated circuit (1958)
• IBM System 360 (1965)
• DEC PDP-8 (1965)
Microprocessor (1971)
• Intel 4004
Birth of the Revolution – The Intel 4004
Introduced November 15, 1971
108KHz, 50 KIPs, 2300 10μ transistors
Sunway TaihuLight
• SW26010 processor
(Chinese design, ISA, & fab)
• 1.45 GHz
• Node = 260 Cores (1 socket)
• 4 – core groups
• 32 GB memory
• 40,960 nodes in the system
• 10,649,600 cores total
• 1.31 PB of primary memory (DDR3).
• 125.4 Pflop/s theoretical peak
• 93 Pflop/s HPL, 74% peak
• 15.3 Mwatts water cooled
• 3 of the 6 finalists for
Gordon Bell Award@SC16
Top 500 Supercomputers - November 2016
Rank Name Site Computer Total Cores Rmax Rpeak Power Mflops/W
1
Sunway
TaihuLight
National Supercomputing
Center in Wuxi
Sunway MPP, Sunway SW26010
260C 1.45GHz, Sunway
10649600
93014593,
88
125435904 15371 6051,3
2
Tianhe-2
(MilkyWay-2)
National Super Computer
Center in Guangzhou
TH-IVB-FEP Cluster, Intel Xeon
E5-2692 12C 2.200GHz, TH
Express-2, Intel Xeon Phi 31S1P
3120000 33862700 54902400 17808 1901,54
3 Titan
DOE/SC/Oak Ridge National
Laboratory
Cray XK7 , Opteron 6274 16C
2.200GHz, Cray Gemini
interconnect, NVIDIA K20x
560640 17590000 27112550 8209 2142,77
4 Sequoia DOE/NNSA/LLNL
BlueGene/Q, Power BQC 16C
1.60 GHz, Custom
1572864 17173224 20132659,2 7890 2176,58
5 Cori DOE/SC/LBNL/NERSC
Cray XC40, Intel Xeon Phi 7250
68C 1.4GHz, Aries interconnect
622336 14014700 27880653 3939 3557,93
6
Oakforest-
PACS
Joint Center for Advanced
High Performance
Computing
PRIMERGY CX1640 M1, Intel
Xeon Phi 7250 68C 1.4GHz, Intel
Omni-Path
556104 13554600 24913459 2718,7 4985,69
7
RIKEN Advanced Institute for
Computational Science
(AICS)
K computer, SPARC64 VIIIfx
2.0GHz, Tofu interconnect
705024 10510000 11280384 12659,89 830,18
8 Piz Daint
Swiss National
Supercomputing Centre
(CSCS)
Cray XC50, Xeon E5-2690v3 12C
2.6GHz, Aries interconnect ,
NVIDIA Tesla P100
206720 9779000 15987968 1312 7453,51
9 Mira
DOE/SC/Argonne National
Laboratory
BlueGene/Q, Power BQC 16C
1.60GHz, Custom
786432 8586612 10066330 3945 2176,58
10 Trinity DOE/NNSA/LANL/SNL
Cray XC40, Xeon E5-2698v3 16C
2.3GHz, Aries interconnect
301056 8100900 11078861 4232,63 1913,92
Performance Development of HPC
over the Last 23 Years from the Top500
0,1
1
10
100
1000
10000
100000
1000000
10000000
100000000
1E+09
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
59.7 GFlop/s
400 MFlop/s
1.17 TFlop/s
93 PFlop/s
286 TFlop/s
567 PFlop/s
SUM
N=1
N=500
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s
1 Eflop/s
Supercomputer Performance Road Map
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 20071987 1989
Our origins...Plan Nacional de Investigación
High-performance Computing group @ Computer Architecture Department (UPC)
Relevance
High-speed
Low-cost Parallel
Architecture Design
PA85-0314
High Performance
Computing
TIC95-429
Architectures and
Compilers
for Supercomputers
TIC92-880
Parallelism
Exploitation in High
Speed Architectures
TIC89-299
High Performance
Computing II
TIC98-511-C02-01
High Performance
Computing III
TIC2001-995-C02-01
High Performance
Computing IV
TIN2004-07739-C02-01
High Performance
Computing VI
TIN2012-34557
2008 - 2011 2012 - 20151988
High Performance
Computing V
TIN2007-60625
CEPBA CIRI BSC
COMPAQ
INTEL
MICROSOFT
IBM
INTEL
(Exascale)
NVIDIA
REPSOL
SAMSUNG
IBERDROLA
Excellence
Venimos de muy lejos…
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 20071987 1988 1989 2008 200919861985 2010
IBM PP970 / Myrinet
MareNostrum
42.35, 94.21 Tflop/s
IBM RS-6000 SP & IBM p630
192+144 Gflop/s
SGI Origin 2000
32 Gflop/s
Connection Machine CM-200
0,64 Gflop/s
Convex C3800
Compaq GS-140
12.5 Gflop/s
Compaq GS-160
23.4 Gflop/s
Parsys Multiprocessor Parsytec CCi-8D
4.45 Gflop/s BULL NovaScale 5160
48 Gflop/s
Research prototypes
Transputer cluster
SGI Altix 4700
819.2 Gflops
SL8500
6 Petabytes
Maricel
14.4 Tflops, 20 KW
Barcelona Supercomputing Center
Centro Nacional de Supercomputación
Spanish Government 60%
Catalonian Government 30%
Univ. Politècnica de Catalunya (UPC) 10%
BSC-CNS is
a consortium
that includes
BSC-CNS objectives
Supercomputing services
to Spanish and
EU researchers
R&D in Computer,
Life, Earth and
Engineering Sciences
PhD programme,
technology transfer,
public engagement
Barcelona Supercomputing Center
Centro Nacional de Supercomputación
475 people from
44 countries
*31th of December 2016
Competitive
project funding
secured
(2005 to 2017)
Total 144,8 M€Information compiled 16/01/2017
Europe 71,9M€
National 34 M€
Companies 38,9 M€
The MareNostrum 3 Supercomputer
Over 1015 Floating Point Operations per second
70% PRACE 24% RES 6% BSC-CNS
3 PB
of disk storage
100.8 TB
of main memory
Nearly
50,000 cores
The MareNostrum 4 Supercomputer
Total peak performance
13,7 Pflops/s
12 times more powerful than MareNostrum 3
Compute
General Purpose, for current BSC workload
More than 11 Pflops/s
With 3,456 nodes of Intel Xeon V5 processors
Emerging Technologies, for evaluation
of 2020 Exascale systems
3 systems, each of more than 0,5 Pflops/s
with KLN/KNH, Power+NVIDIA, ARMv8
Storage
More than 10 PB of GPFS
Elastics Storage System
Network
IB EDR/OPA
Ethernet
Operating System: SuSE
Mission of BSC Scientific Departments
Earth
Sciences
CASE
Computer
Sciences
Life
Sciences
To influence the way machines are built, programmed and
used: computer architecture, programming models,
performance tools, Big Data, Artificial Intelligence
To develop and implement global and
regional state-of-the-art models for short-
term air quality forecast and long-term
climate applications
To understand living organisms by means of
theoretical and computational methods
(molecular modeling, genomics, proteomics)
To develop scientific and engineering software to
efficiently exploit super-computing capabilities
(biomedical, geophysics, atmospheric, energy, social
and economic simulations)
Design of Superscalar Processors
Simple interface
Sequential
program
ILP
ISA
Programs
“decoupled”
from hardware
Applications
Decoupled from the software stack
Latency Has Been a Problem from the
Beginning... 
• Feeding the pipeline with the right instructions:
• HW/SW trace cache (ICS’99)
• Prophet/Critic Hybrid Branch Predictor (ISCA’04)
• Locality/reuse
• Cache Memory with Hybrid Mapping (IASTED87). Victim Cache 
• Dual Data Cache (ICS¨95)
• A novel renaming mechanism that boosts software prefetching (ICS’01)
• Virtual-Physical Registers (HPCA’98)
• Kilo Instruction Processors (ISHPC03,HPCA’06, ISCA’08)
Fetch
Decode
Rename
Instruction
Window
Wakeup+
select
Register
file
Bypass
DataCache
Register
Write
Commit
… and the Power Wall Appeared Later 
• Better Technologies
• Two-level organization (Locality Exploitation)
• Register file for Superscalar (ISCA’00)
• Instruction queues (ICCD’05)
• Load/Store Queues (ISCA’08)
• Direct Wakeup, Pointer-based Instruction Queue Design (ICCD’04,
ICCD’05)
• Content-aware register file (ISCA’09)
• Fuzzy computation (ICS’01, IEEE CAL’02, IEEE-TC’05). Currently known as
Approximate Computing 
Fetch
Decode
Rename
Instruction
Window
Wakeup+
select
Register
file
Bypass
DataCache
Register
Write
Commit
Fuzzy computation
Accuracy Size
Performance
@ Low Power
Binary
systems
(bmp)
Compresion
protocols
(jpeg)
Fuzzy
Computation
This one only used
~85% of the time
while consuming
~75% of the power
This image is the
original one
SMT and Memory Latency … 
• Simultaneous Multithreading (SMT)
• Benefits of SMT Processors:
• Increase core resource utilization
• Basic pipeline unchanged:
• Few replicated resources, other shared
• Some of our contributions:
• Dynamically Controlled Resource Allocation (MICRO 2004)
• Quality of Service (QoS) in SMTs (IEEE TC 2006)
• Runahead Threads for SMTs (HPCA 2008)
Fetch
Decode
Rename
Instruction
Window
Wakeup+
select
Register
file
Bypass
DataCache
Register
Write
Commit
Thread 1
Thread N
Time Predictability (in multicore and SMT processors)
• Where is it required:
• Increasingly required in handheld/desktop devices
• Also in embedded hard real-time systems (cars, planes, trains, …)
• How to achieve it:
• Controlling how resources are assigned to co-running tasks
• Soft real-time systems
• SMT: DCRA resource allocation policy (MICRO 2004, IEEE Micro 2004)
• Multicores: Cache partitioning (ACM OSR 2009, IEEE Micro 2009)
• Hard real-time systems
• Deterministic resource ‘securing’ (ISCA 2009)
• Time-Randomised designs (DAC 2014 best paper award)
QoS
spaceDefinition:
• Ability to provide a minimum performance to a task
• Requires biasing processor resource allocation
Vector Architectures… Memory Latency
and Power 
• Out-of-Order Access to Vectors (ISCA 1992, ISCA 1995)
• Command Memory Vector (PACT 1998)
• In-memory computation
• Decoupling Vector Architectures (HPCA 1996)
• Cray SX1
• Out-of-order Vector Architectures (Micro 1996)
• Multithreaded Vector Architectures (HPCA 1997)
• SMT Vector Architectures (HICS 1997, IEEE MICRO J. 1997)
• Vector register-file organization (PACT 1997)
• Vector Microprocessors (ICS 1999, SPAA 2001)
• Architectures with Short Vectors (PACT 1997, ICS 1998)
• Tarantula (ISCA 2002), Knights Corner
• Vector Architectures for Multimedia (HPCA 2001, Micro 2002)
• High-Speed Buffers Routers (Micro 2003, IEEE TC 2006)
• Vector Architectures for Data-Base (Micro 2012, HPCA2015,ISCA2016)
Statically scheduled VLIW architectures
• Power-efficient FU
• Clustering
• Widening (MICRO-98)
• μSIMD and multimedia vector units
(ICPP-05)
• Locality-aware RF
• Sacks (CONPAR-94)
• Non-consistent (HPCA95)
• Two-level hierarchical (MICRO-00)
• Integrated modulo scheduling
techniques, register allocation and spilling
(MICRO-95, PACT-96, MICRO-96, MICRO-01)
The MultiCore Era
Moore’s Law + Memory Wall + Power Wall
UltraSPARC T2
(2007)
Intel Xeon
7100 (2006)
POWER4
(2001)
Chip MultiProcessors (CMPs)
How Multicores Were Designed at the Beginning?
IBM Power4 (2001)
• 2 cores, ST
• 0.7 MB/core L2,
16MB/core L3 (off-chip)
• 115W TDP
• 10GB/s mem BW
IBM Power7 (2010)
• 8 cores, SMT4
• 256 KB/core L2
16MB/core L3 (on-chip)
• 170W TDP
• 100GB/s mem BW
IBM Power8 (2014)
• 12 cores, SMT8
• 512 KB/core L2
8MB/core L3 (on-chip)
• 250W TDP
• 410GB/s mem BW
How To Parallelize Future Applications?
• From sequential to parallel codes
• Efficient runs on manycore processors
implies handling:
• Massive amount of cores and available
parallelism
• Heterogeneous systems
• Same or multiple ISAs
• Accelerators, specialization
• Deep and heterogeneous memory hierarchy
• Non-Uniform Memory Access (NUMA)
• Multiple address spaces
• Stringent energy budget
• Load Balancing
Programmability Wall
Interconnect
L2 L2
DRAM
DRAM
MC
L3 L3 L3L3
MRAM
MRAM
C
C
C
C
ClusterInterconnect
C C
C C
C
C
C
C
ClusterInterconnect
C C
C C
C CA A
Living in the Programming Revolution
Multicores made the
interface to leak…
ISA / API
Parallel hardware
with multiple
address spaces
(hierarchy, transfer),
control flows, …
Applications
Parallel application
logic
+
Platform specificites
Applications
The efforts are
focused on
efficiently using the
underlying
hardware
ISA / API
Vision in the Programming Revolution
Need to decouple again
General purpose
Single address space
Application logic
Arch. independentApplications
Power to the runtime
PM: High-level, clean, abstract interface
History / Strategy
SMPSs V2
~2009
GPUSs
~2009
CellSs
~2006
SMPSs V1
~2007
PERMPAR
~1994
COMPSs
~2007
NANOS
~1996
COMPSs
ServiceSs
~2010
COMPSs
ServiceSs
PyCOMPSs
~2013
OmpSs
~2008
OpenMP … 3.0 …. 4.0 ….
StarSs
~2008
DDT @
Parascope
~1992
2008 2013
Forerunner of OpenMP
GridSs
~2002
OmpSs
A forerunner for OpenMP
+ Prototype
of tasking
+ Task
dependences
+ Task
priorities
+ Taskloop
prototyping
+ Task reductions
+ Dependences
on taskwaits
+ OMPT impl.
+ Multidependences
+ Commutative
+ Dependences
on taskloops
today
OmpSs: data-flow execution of sequential programs
void Cholesky( float *A ) {
int i, j, k;
for (k=0; k<NT; k++) {
spotrf (A[k*NT+k]) ;
for (i=k+1; i<NT; i++)
strsm (A[k*NT+k], A[k*NT+i]);
// update trailing submatrix
for (i=k+1; i<NT; i++) {
for (j=k+1; j<i; j++)
sgemm( A[k*NT+i], A[k*NT+j], A[j*NT+i]);
ssyrk (A[k*NT+i], A[i*NT+i]);
}
}#pragma omp task inout ([TS][TS]A)
void spotrf (float *A);
#pragma omp task input ([TS][TS]A) inout ([TS][TS]C)
void ssyrk (float *A, float *C);
#pragma omp task input ([TS][TS]A,[TS][TS]B) inout ([TS][TS]C)
void sgemm (float *A, float *B, float *C);
#pragma omp task input ([TS][TS]T) inout ([TS][TS]B)
void strsm (float *T, float *B);
Decouple how we write
applications form
how they are executed
Write
Execute
Clean offloading to
hide architectural
complexities
OmpSs: A Sequential Program …
void vadd3 (float A[BS], float B[BS],
float C[BS]);
void scale_add (float sum, float A[BS],
float B[BS]);
void accum (float A[BS], float *sum);
for (i=0; i<N; i+=BS) // C=A+B
vadd3 ( &A[i], &B[i], &C[i]);
...
for (i=0; i<N; i+=BS) //sum(C[i])
accum (&C[i], &sum);
...
for (i=0; i<N; i+=BS) // B=sum*A
scale_add (sum, &E[i], &B[i]);
...
for (i=0; i<N; i+=BS) // A=C+D
vadd3 (&C[i], &D[i], &A[i]);
...
for (i=0; i<N; i+=BS) // E=G+F
vadd3 (&G[i], &F[i], &E[i]);
OmpSs: …Taskified…
#pragma css task input(A, B) output(C)
void vadd3 (float A[BS], float B[BS],
float C[BS]);
#pragma css task input(sum, A) inout(B)
void scale_add (float sum, float A[BS],
float B[BS]);
#pragma css task input(A) inout(sum)
void accum (float A[BS], float *sum);
for (i=0; i<N; i+=BS) // C=A+B
vadd3 ( &A[i], &B[i], &C[i]);
...
for (i=0; i<N; i+=BS) //sum(C[i])
accum (&C[i], &sum);
...
for (i=0; i<N; i+=BS) // B=sum*A
scale_add (sum, &E[i], &B[i]);
...
for (i=0; i<N; i+=BS) // A=C+D
vadd3 (&C[i], &D[i], &A[i]);
...
for (i=0; i<N; i+=BS) // E=G+F
vadd3 (&G[i], &F[i], &E[i]);
1 2 3 4
13 14 15 16
5 6 87
17
9
18
10
19
11
20
12
Color/number: order of task instantiation
Some antidependences covered by flow dependences not drawn
Write
Decouple
how we write
form
how it is executed
… and Executed in a Data-Flow Model
#pragma css task input(A, B) output(C)
void vadd3 (float A[BS], float B[BS],
float C[BS]);
#pragma css task input(sum, A) inout(B)
void scale_add (float sum, float A[BS],
float B[BS]);
#pragma css task input(A) inout(sum)
void accum (float A[BS], float *sum);
1 1 1 2
2 2 2 3
2 3 54
7
6
8
6
7
6
8
7
for (i=0; i<N; i+=BS) // C=A+B
vadd3 ( &A[i], &B[i], &C[i]);
...
for (i=0; i<N; i+=BS) //sum(C[i])
accum (&C[i], &sum);
...
for (i=0; i<N; i+=BS) // B=sum*A
scale_add (sum, &E[i], &B[i]);
...
for (i=0; i<N; i+=BS) // A=C+D
vadd3 (&C[i], &D[i], &A[i]);
...
for (i=0; i<N; i+=BS) // E=G+F
vadd3 (&G[i], &F[i], &E[i]);
Write
Execute
Color/number: a possible order of task execution
OmpSs: Potential of Data Access Info
• Flat global address space seen by
programmer
• Flexibility to dynamically traverse
dataflow graph “optimizing”
• Concurrency. Critical path
• Memory access: data transfers
performed by run time
• Opportunities for automatic
• Prefetch
• Reuse
• Eliminate antidependences (rename)
• Replication management
• Coherency/consistency handled by
the runtime
• Layout changes
Processor
CPU
On-chip cache
Off-chip BW
CPU
Main Memory
PPU
User
main
program
CellSs PPU lib
SPU0
DMA in
Task execution
DMA out
Synchronization
CellSs SPU lib
Original task
code
Helper threadmain thread
Memory
User
data
Renaming
Task graph
Synchronization
Tasks
Finalization
signal
Stage in/out
data
Work
assignment
Data dependence
Data renaming
Scheduling
SPU1
SPU2
SPE threads
FUFUFU
Helper thread
CellSs implementation
IFU
REG
ISSIQRENDEC
RET
Main thread
P. Bellens, et al, “CellSs: A Programming Model for the Cell BE Architecture” SC’06.
P. Bellens, et al, “CellSs: Programming the Cell/B.E. made easier” IBM JR&D 2007
Renaming @ Cell
• Experiments on the CellSs (predecessor of OmpSs)
• Renaming to avoid anti-dependences
• Eager (similarly done at SS designs)
• At task instantiation time
• Lazy (similar to virtual registers)
• Just before task execution
P. Bellens, et al, “CellSs: Scheduling Techniques to Better Exploit Memory
Hierarchy” Sci. Prog. 2009
Main memory transfers (cold)
Main Memory transfers
(capacity)
Killed transfers
SMPSs: Stream benchmark reduction in execution time
SMPSs: Jacobi reduciton in # remanings
Data Reuse @ Cell
P. Bellens, et al, “CellSs: Scheduling Techniques to Better Exploit Memory Hierarchy” Sci. Prog. 2009
Matrix-matrix multiply
• Experiments on the CellSs
• Data Reuse
• Locality arcs in dependence graph
• Good locality but high overhead  no time improvement
Reducing Data Movement @ Cell
• Experiments on the CellSs (predecessor of
OmpSs)
• Bypassing / global software cache
• Distributed implementation
• @each SPE
• Using object descriptors managed atomically with
specific hardware support (line level LL-SC)
Main memory:
cold
Main memory:
capacity
Global
software cache
Local
software cache
P. Belens et al, “Making the Best of Temporal Locality: Just-In-Time Renaming
and Lazy Write-Back on the Cell/B.E.” IJHPC 2010
DMA Reads
GPUSs implementation
• Architecture implications
• Large local store O(GB)  large task granularity  Good
• Data transfers: Slow, non overlapped  Bad
• Cache management
• Write-through
• Write-back
• Run time implementation
• Powerful main processor and multiple cores
• Dumb accelerator (not able to perform data transfers, implement
software cache,…)
Slave threads
FUFUFU
Helper thread
IFU
REG
ISSIQRENDEC
RET
Main thread
E. Ayguade, et al, “An Extension of the StarSs Programming Model for Platforms with Multiple GPUs” Europar 2009
Prefetching @ multiple GPUs
• Improvements in runtime mechanisms (OmpSs +
CUDA)
• Use of multiple streams
• High asynchrony and overlap (transfers and kernels)
• Overlap kernels
• Take overheads out of the critical path
• Improvement in schedulers
• Late binding of locality aware decisions
• Propagate priorities
J. Planas et al, “Optimizing Task-based Execution Support on Asynchronous Devices.” Submitted
Nbody
Cholesky
ISA / API
Runtime Aware Architectures
The runtime drives the hardware design
Applications
Runtime
PM: High-level, clean, abstract interface
Task based PM
annotated by the user
Data dependencies
detected at runtime
Dynamic scheduling
“Reuse” architectural
ideas under
new constraints
Superscalar vision at Multicore level
Programmability
Wall
Resilience Wall
Memory Wall Power Wall
Superscalar World
Out-of-Order, Kilo-Instruction Processor,
Distant Parallelism
Branch Predictor, Speculation
Fuzzy Computation
Dual Data Cache, Sack for VLIW
Register Renaming, Virtual Regs
Cache Reuse, Prefetching, Victim C.
In-memory Computation
Accelerators, Different ISA’s, SMT
Critical Path Exploitation
Resilience
Multicore World
Task-based, Data-flow Graph, Dynamic
Parallelism
Tasks Output Prediction,
Speculation
Hybrid Memory Hierarchy, NVM
Late Task Memory Allocation
Data Reuse, Prefetching
In-memory FU’s
Heterogeneity of Tasks and HW
Task-criticality
Resilience
Load Balancing and Scheduling
Interconnection Network
Data Movement
Architecture Proposals in RoMoL
C C
L1
ClusterInterconnect
LM
L1
LM
C C
L1
LM
L1
LM
C C
L1
LM
L1
LM
C C
L1
LM
L1
LM
Stacked
DRAM
External
DRAM
L2
L3 cache
Cluster Interconnect
- Priority-based arbitration
- By-pass routing
Runtime Support Unit
- DVFS
- Light-weight deps tracking
- Task memoization
- Reduced data motion
Vectors
- DB, sorting
- BTrees
Cache Hierarchy
- LM usage
- Coherence
- Eviction policies
- Reductions
PICOS
Runtime Management of Local Memories (LM)
LM Management in OmpSs
– Task inputs and outputs mapped to the LMs
– Runtime manages DMA transfers
8.7% speedup in execution time
14% reduction in power
20% reduction in network-on-chip traffic
0,8
0,9
1
1,1
1,2
jacobi kmeans md5 tinyjpeg vec_add vec_red
Speedup
Cache
Hybrid
Ll. Alvarez et al. Transparent Usage of Hybrid on-Chip Memory Hierarchies in Multicores. ISCA 2015.
Ll. Alvarez et al Runtime-Guided Management of Scratchpad Memories in Multicore Architectures. PACT 2015
C C
L1
ClusterInterconnect
LM
L1
LM
C C
L1
LM
L1
LM
C C
L1
LM
L1
LM
C C
L1
LM
L1
LM
Stacked
DRAM
External
DRAM
L2
L3 cache
PICOS
OmpSs in Heterogeneous Systems
Heterogeneous systems
• Big-little processors
• Accelerators
• Hard to program
big
little
big big
big
little little
little
Task-based programming models can adapt to these scenarios
• Detect tasks in the critical path and run them in fast cores
• Non-critical tasks can run in slower cores
• Assign tasks to the most energy-efficient HW component
• Runtime takes core of balancing the load
• Same performance with less power consumption
Architectural Support for DVFS
Reduce overheads of software solution
– Serialization in DVFS reconfigurations
– User-kernel mode switches
Runtime Support Unit (RSU)
– Power budget
– State of cores
– Criticality of running tasks
Runtime system notifies RSU
– Start task execution
• Criticality
• Running core
– End task execution
Same algorithm for DVFS
reconfigurations
Runtime system
Scheduler
HPRQ LPRQ
A A NA NA
State
C C NC NC
Criticality
Power budget 2
SWHW
Core 0 Core 1 Core 2 Core 3
DVFS
Cntrl
RSU
Architectural Support for DVFS
Runtime system
Scheduler
HPRQ LPRQ
A A NA NA
State
C C NC NC
Criticality
Power budget 2
SWHW
Core 0 Core 1 Core 2 Core 3
DVFS
Cntrl
RSU
0,6
0,7
0,8
0,9
1
1,1
1,2
1,3
1,4
Speedup
FIFO CATS CATA CATA+RSU TurboMode
0,4
0,5
0,6
0,7
0,8
0,9
1
1,1
EDP
E. Castillo, CATA: Criticality Aware Task Acceleration for Multicore Processors (IPDPS’16)
TaskSuperscalar (TaskSs) Pipeline
• Hardware design for a distributed task
superscalar pipeline frontend (MICRO’10)
• Can be embedded into any manycore fabric
• Drive hundreds of threads
• Work windows of thousands of tasks
• Fine grain task parallelism
• TaskSs components:
• Gateway (GW): Allocate resources for task meta-data
• Object Renaming Table (ORT)
• Map memory objects to producer tasks
• Object Versioning Table (OVT)
• Maintain multiple object versions
• Task Reservation Stations (TRS)
• Store and track task in-flght meta-data
• Implementing TaskSs @ Xilinx Zynq
GW
TRS
ORT
Ready Queue
OVT
TaskSs pipeline
Scheduler
C C C C
C C C C
C C C C
C C C C
Multicore Fabric
Y. Etsion et al, “Task Superscalar: An Out-of-Order Task Pipeline” MICRO-43, 2010
Architectural Support for Task Dependence
Management (TDM) with Flexible Software Scheduling
• Task creation is a bottleneck since it
involves dependence tracking
• Our hardware proposal (TDM)
• takes care of dependence tracking
• exposes scheduling to the SW
• Our results demonstrate that this
flexibility allows TDM to beat the
state-of-the-art
E. Castillo et al, Architectural Support for Task Dependence Management
with Flexible Software Scheduling submitted to MICRO’17)
Approximate Task Memoization (ATM)
• Approximate Task Memoization (ATM) aims at eliminating
redundant computations.
• ATM leverages runtime system metadatata to identify tasks
that can be memoized.
• ATM achieves 1.4x average speedup when only applying
memoization techniques (Static ATM).
• ATM achieves an increased 2.5x average speedup with an average
0.7% accuracy loss with task approcimation (Dynamic ATM).
I. Brumar et al, ATM: Approximate Task Memoization in the Runtime System (IPDPS’17)
Exploiting the Task Dependency Graph
(TDG) to Reduce Coherence Traffic
• To reduce coherence traffic, the
state-of-the-art applies round-robin
mechanisms at the runtime level.
• Exploiting the information
contained at the TDG level is
effective to
• improve performance
• dramatically reduce coherence
traffic (2.26x reduction with respect
to the state-of-the-art).
State-of-the-art Partition (DEP)
Gauss-Seidel TDG
DEP requires ~200GB of
data transfer across a 288
cores system
Exploiting the Task Dependency Graph
(TDG) to Reduce Coherence Traffic
• To reduce coherence traffic, the state-
of-the-art applies round-robin
mechanisms at the runtime level.
• Exploiting the information contained at
the TDG level is effective to
• improve performance
• dramatically reduce coherence traffic
(2.26x reduction with respect to the
state-of-the-art).
Graph Algorithms-Driven Partition (RIP-DEP)
Gauss-Seidel TDG
RIP-DEP requires ~90GB
of data transfer across a
288 cores system
I. Sánchez et al, Reducing Data Movements on Shared Memory
Architectures (submitted to SC’17)
Dealing with a New Form Of Heterogeneity
• Manufacturing Variability of CPUs – Different
power consumption
• Power variability becomes performance
heterogeneity in power constrained
environments
• Typical load-balancing may not be
sufficient
• Redistributing power and number
of active cores among sockets can
improve performance
even distribution
optimal distribution
Dynamic Analysis and Exploration
• Statically trying all configurations is not practical
• Huge overhead (one execution for each configuration)
• Has to be performed on each node
• Online analysis: Try multiple configurations in a single run.
Performance Benefit Energy
Savings
D. Chasapis et al, Runtime-Guided Mitigation of Manufacturing Variability
in Power-Constrained Multi-Socket NUMA Nodes (ICS’16)
Introduction - A New Form Of Heterogeneity
• Platform: 2 x sockets with 12 core Intel Xeon E5-2695v2
• Power variability becomes performance heterogeneity in
power constrained environments
Hash Join, Sorting, Aggregation, DBMS
• Goal: Vector acceleration of data bases
• “Real vector” extensions to x86
• Pipeline operands to the functional unit (like Cray machines,
not like SSE/AVX)
• Scatter/gather, masking, vector length register
• Implemented in PTLSim + DRAMSim2
• Hash join work published in MICRO 2012
• 1.94x (large data sets) and 4.56x (cache resident data sets)
of speedup for TPC-H
• Memory bandwidth is the bottleneck
• Sorting paper published in HPCA 2015
• Compare existing vectorized quicksort, bitonic mergesort,
radix sort on a consistent platform
• Propose novel approach (VSR) for vectorizing radix sort with
2 new instructions
• Similarity with AVX512-CD instructions
(but cannot use Intel’s instructions because the
algorithm requires strict ordering)
• Small CAM
• 3.4x speedup over next-best vectorised algorithm with the
same hardware configuration due to:
• Transforming strided accesses to unit-stride
• Elminating replicated data structures
• Ongoing work on aggregations
• Reduction to a group of values, not a single scalar value
ISCA 2016
• Building from VSR work
0
2
4
6
8
10
12
14
16
18
20
22
mvl-8
mvl-16
mvl-32
mvl-64
mvl-8
mvl-16
mvl-32
mvl-64
mvl-8
mvl-16
mvl-32
mvl-64
mvl-8
mvl-16
mvl-32
mvl-64
quicksort bitonic radix vsr
speedupoverscalarbaseline
1 lane 2 lanes 4 lanes
Overlap Communication and Computation
• Hybrid MPI/OmpSs: Linpack example
• Extend asynchronous data-flow
execution to outer level
• Taskify MPI communication primitives
• Automatic lookahead
• Improved performance
• Tolerance to network bandwidth
• Tolerance to OS noise
P0 P1 P2
V. Marjanovic et al, “Overlapping Communication and Computation by using a
Hybrid MPI/SMPSs Approach” ICS 2010
Effects on Bandwidth
flattening
communication pattern
thus
reducing
bandwidth requirements
*simulation on application with
ring communication pattern
V. Subotic et al. “Overlapping communication and computation by
enforcing speculative data-flow”, January 2008, HiPEAC
Related Work
• Rigel Architecture (ISCA 2009)
• No L1D, non-coherent L2, read-only, private and cluster-shared data
• Global accesses bypass the L2 and go directly to L3
• SARC Architecture (IEEE MICRO 2010)
• Throughput-aware architecture
• TLBs used to access remote LMs and migrate data accross LMs
• Runnemede Architecture (HPCA 2013)
• Coherence islands (SW managed) + Hierarchy of LMs
• Dataflow execution (codelets)
• Carbon (ISCA 2007)
• Hardware scheduling for task-based programs
• Holistic run-time parallelism management (ICS 2013)
• Runtime-guided coherence protocols (IPDPS 2014)
RoMoL … papers
• V. Marjanovic et al., “Effective communication and computation overlap with
hybrid MPI/SMPSs.” PPoPP 2010
• Y. Etsion et al., “Task Superscalar: An Out-of-Order Task Pipeline.” MICRO 2010
• N. Vujic et al., “Automatic Prefetch and Modulo Scheduling Transformations for
the Cell BE Architecture.” IEEE TPDS 2010
• V. Marjanovic et al., “Overlapping communication and computation by using a
hybrid MPI/SMPSs approach.” ICS 2010
• T. Hayes et al., “Vector Extensions for Decision Support DBMS Acceleration”.
MICRO 2012
• L. Alvarez,et al., “Hardware-software coherence protocol for the coexistence of
caches and local memories.” SC 2012
• M. Valero et al., “Runtime-Aware Architectures: A First Approach”. SuperFRI
2014
• L. Alvarez,et al., “Hardware-Software Coherence Protocol for the Coexistence of
Caches and Local Memories.” IEEE TC 2015
RoMoL … papers
• M. Casas et al., “Runtime-Aware Architectures”. Euro-Par 2015.
• T. Hayes et al., “VSR sort: A novel vectorised sorting algorithm & architecture
extensions for future microprocessors”. HPCA 2015
• K. Chronaki et al., “Criticality-Aware Dynamic Task Schedulling for
Heterogeneous Architectures”. ICS 2015
• L. Alvarez et al., “Coherence Protocol for Transparent Management of
Scratchpad Memories in Shared Memory Manycore Architectures”. ISCA 2015
• L. Alvarez et al., “Run-Time Guided Management of Scratchpad Memories in
Multicore Architectures”. PACT 2015
• L. Jaulmes et al., “Exploiting Asycnhrony from Exact Forward Recoveries for DUE
in Iterative Solvers”. SC 2015
• D. Chasapis et al., “PARSECSs: Evaluating the Impact of Task Parallelism in the
PARSEC Benchmark Suite.” ACM TACO 2016.
• E. Castillo et al., “CATA: Criticality Aware Task Acceleration for Multicore
Processors.” IPDPS 2016
RoMoL … papers
• T. Hayes et al “Future Vector Microprocessor Extensions for Data
Aggregations.” ISCA 2016.
• D. Chasapis et al., “Runtime-Guided Mitigation of Manufacturing Variability
in Power-Constrained Multi-Socket NUMA Nodes.” ICS 2016
• P. Caheny et al., “Reducing cache coherence traffic with hierarchical
directory cache and NUMA-aware runtime scheduling.” PACT 2016
• T. Grass et al., “MUSA: A multi-level simulation approach for next-
generation HPC machines.” SC 2016
• I. Brumar et al., “ATM: Approximate Task Memoization in the Runtime
System.” IPDPS 2017
• K. Chronaki et al., “Task Scheduling Techniques for Asymmetric Multi-Core
Systems.” IEEE TPDS 2017
• C. Ortega et al., “libPRISM: An Intelligent Adaptation of Prefetch and SMT
Levels.” ICS 2017
Roadmaps to Exaflop
From Tianhe-2..
…to Tianhe-2A
with domestic
technology.
From K computer…
… to Post K
with domestic
technology.
From the PPP for
HPC…
to future PRACE
systems…
…with domestic
technology
with domestic
technology.
IPCEI on HPC
?
HPC is a global competition
“The country with the strongest computing capability
will host the world’s next scientific breakthroughs”.
US House Science, Space and Technology Committee Chairman
Lamar Smith (R-TX)
“Our goal is for Europe to become one of the top 3
world leaders in high-performance computing by 2020”.
European Commission President
Jean-Claude Juncker (27 October 2015)
“Europe can develop an exascale machine with
ARM technology. Maybe we need an .
consortium for HPC and Big Data”.
Seymour Cray Award Ceremony Nov. 2015
Mateo Valero
HPC: a disruptive technology for Industry
“…Europe has a unique opportunity to act and
invest in the development and deployment of High
Performance Computing (HPC) technology, Big
Data and applications to ensure the
competitiveness of its research and its industries.”
Günther Oettinger, Digital Economy & Society
Commissioner
“The transformational impact of
excellent science in research and
innovation”
Final plenary panel at ICT - Innovate, Connect,
Transform conference, 22 Oct 2015, Lisbon.
BSC and the EC
“Europe needs to develop an entire
domestic exascale stack from the
processor all the way to the system
and application software“
Mateo Valero, Director of Barcelona
Supercomputing Center
Final plenary panel at ICT - Innovate,
Connect, Transfor”m conference, 22
October 2015 Lisbon, Portugal.
the transformational impact of excellent science in
research and innovation
Mont-Blanc HPC Stack for ARM
Industrial applications
System software
Hardware
Applications
512 RiscV cores in 64 clusters, 16GF/core: 8TF
4 HBM stacks (16GB, 1TB/s each): 64GB @ 4TB/s
16 custom SCM/Flash channels (1TB, 25GB/s each): 16TB @ 0.4TB/s
BSC Accelerator RISC-V ISA
Vector Unit
· 2048b vector
· 512b alu (4clk/op)
1 GHz @ Vmin
OOO
4w Fetch
· 64KB I$
· Decoupled I$/BP
· 2 level BP
· Loop Stream Detector
4w Rename/Retire
D$
· 64KB
· 64B/line
· 128 in-flight misses
· Hardware prefetch
1MB L2 per core
D$ to L2
· 1x512b read
· 1x512b write
L2 to mesh
· 1x512b read
· 1x512b write
Cluster holds snoop
filter
interposer
cyclone
mem mem
mem mem
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
Flash
A D D H H H H D D A
D C C C C C C C C D
D C C C C C C C C D
H C C C C C C C C H
H C C C C C C C C H
H C C C C C C C C H
H C C C C C C C C H
D C C C C C C C C D
D C C C C C C C C D
A D D H H H H D D A
Interposer
cyclone flashmem
Package substrate
Do we need an
type consortium for HPC and Big Data?
A window of opportunity is open:
• Basic industrial and scientific know-how is available
• Excellent funding opportunities exist in H2020 at European level
and in the member state structural funds
It’s time to invest in large Flagship projects
for HPC to gain critical mass
HPC European strategy & Innovation
http://ec.europa.eu/commission/2014-2019/oettinger/blog/mateo-valero-
director-barcelona-supercomputing-center_en
A New Era of Information Technology
Current infrastructure sagging under its own weight
Prof. Mateo Valero – Big Data
Pervasive
Connectivity
Explosion of
Information
400,710 ad
requests
2000 lyrics played
on Tunewiki
1,500 pings sent on
PingMe
208,333 minutes
Angry Birds played
23,148 apps
downloaded
98,000 tweets
Smart Device
Expansion
In 60 sec
today
2013
30
Billion
By 2020
40
Trillion GB
… for 8
Billion
10
Million
DATA
(1)
(2)
(3)
Devices
Mobile
Apps
(4)
(1) IDC Directions 2013: Why the Datacenter of the Future Will Leverage a Converged Infrastructure, March 2013, Matt Eastwood ; (2) & (3) IDC Predictions 2012: Competing for 2020, Document 231720, December 2011, Frank Gens; (4) http://en.wikipedia.org
Internet of Things
HPC European strategy & Innovation
The Data Deluge
2020
40ZB*
(figure exceeds prior forecasts
by 5 ZBs)
2005 2010
2012
2015
8.5ZB
2.8ZB
1.2ZB0.1ZB
* Source: IDC
Prof. Mateo Valero – Big Data
10
30
This will take us
beyond our
decimal system
Geopbyte
This will be our digital
universe tomorrow…
Brontobyte
10
27
10
24This is our digital universe today
= 250 trillion of DVDs
Yottabyte
1021
1.3 ZB of network
traffic by 2016
Zettabyte
10
18
1 EB of data is created on the internet each day = 250 million
DVDs worth of information. The proposed Square Kilometer
Array telescope will generated an EB of data per day
Exabyte
10
12
Terabyte
500TB of new data per day are ingested in Facebook databases
1015
Petabyte
The CERN Large Hadron Collider
generates 1PB per second
109
Gigabyte
10
6
Megabyte
How big is big?
Saganbyte, Jotabyte,…
Prof. Mateo Valero – Big Data
Higgs and Englert’s Nobel for Physics 2013
Last year one of the most computer-intensive scientific
experiments ever undertaken confirmed Peter Higgs and
François Englert’s theory by making the Higgs boson – the so-
called “God particle” – in an $8bn atom smasher, the Large
Hadron Collider at Cern outside Geneva.
“ the LHC produces 600 TB/sec… and after filtering needs to
store 25 PB/year”… 15 million sensors….
Big Data in Biology
• High resolution imaging
• Clinical records
• Simulations
• Omics
Sequencing Costs
75Prof. Mateo Valero – Big Data
Source: National Human Genome Research Institute (NHGRI)
http://www.genome.gov/sequencingcosts/
(1) "Cost per Megabase of DNA Sequence" — the cost of determining one megabase (Mb; a million bases) of DNA sequence of a specified quality
(2) "Cost per Genome" - the cost of sequencing a human-sized genome. For each, a graph is provided showing the data since 2001
In both graphs, the data from 2001 through October 2007 represent the costs of generating DNA sequence using Sanger-based chemistries and
capillary-based instruments ('first generation' sequencing platforms). Beginning in January 2008, the data represent the costs of generating DNA
sequence using 'second-generation' (or 'next-generation') sequencing platforms. The change in instruments represents the rapid evolution of DNA
sequencing technologies that has occurred in recent years.
Cognitive Computing
Acuerdo de colaboración para promover
conjuntamente el desarrollo de sistemas
avanzados de “deep learning” con
aplicaciones a los servicios bancarios
Example: Cognitive Computing is already in
business
• In 2011 IBM Watson computer defeated two of Jeopardy’ s
greatest champions
77
Prof. Mateo Valero – Big Data
Since then, Watson
supercomputer has
become 24 times
faster and smarter,
90% smaller, with a
2,400% improvement
in performance
Watson Group has
collaborated with
partners to build
6,000 apps
Neural Networks
• Computational model in computer science
based on a collection of simple neural units.
• Each neural unit is connected to many others
• The strengths of these connections is
expressed in terms of weights.
• Neural units compute summation functions
• NN are self-learning and can be trained
• NN are particularly good in feature detection.
• In practice, NN can be expressed in
terms of matrix-matrix multiplications.
IA+HPC
SC-2017-SLC
Dios los cría y la IA+HPC los junta….
IA+HPC
SC-2017-SLC
BSC strategy for Artificial Intelligence
Social &
Personal
Data
Organ
simulation
Earth
Sciences
Industrial
CASE apps
Medical
Imaging
Genomic
Analytics
Text
Analytics
Programming models and runtimes
(PyCOMPSs, TIRAMISU, interoperability current approaches)
Data models and algorithms
(approximate computing -- reduced precision, adaptive layers, DL/Graph Analytics, …)
Precision medicine Other domains
Data platforms + standards
Projects with public/private institutions and companies
Hw acceleration of DL workloads
(novel architectures for NN, FPGA acceleration)
NVIDIA Tesla P4 and P40 GPU’s (2016)
• Tesla P4
• # CUDA cores: 2560 @ 1063MHz
• Peak single precision: 5.5TFLOPS
• Peak INT8: 22 TOPS
• Low precision: 8-bit dot-product
with 32-bit accumulate
• VRAM: 8 GB GDDR5 @ 192 GB/s
• TDP: ~75W
• Tesla P40
• # CUDA cores: 2560 @ 1531MHz
• Peak single precision: 12.0TFLOPS
• Peak INT8: 47 TOPS
• Low precision: 8-bit dot-product
with 32-bit accumulate
• VRAM: 24 GB GDDR5 @ 346GB/s
• TDP: ~250W
83
Source: NVIDIA
Google Tensor Processing Unit (2015,
published 2017)
• 34 GB/s off-chip memory
• 28MB on-chip memory
• Frequency 700MHz
• TDP 75W
• Matrix Multiply Unit
• 256x256 MAC Units
• 8-bit multiply and adds
• 32-bit accumulators
• Peak Throughput: 92 TOPS/s
• Power Efficiency: 132 GOPS/W
• GPUs for training, TPUs for inference
• Gameplay to beat the World Go champion
• Internally used at google for Streetview and
• Rankbrain search optimizer
84
Source: google
Nervana’s Lake Crest Deep Learning
Architecture (2017)
• The Lake Crest chip will operate as a Xeon Co-processor.
• Tensor-based (i. e. dense linear algebra computations)
• 4 8GB HBM2 at the same chip interposer@1TB/s
• Each HBM has its own memory controller
• 12 Inter-Chip Links (ICL) 20x faster than PCI
• 12 Computing Nodes featuring several cores
• Intel's new "Flexpoint" architecture within the Nodes
• Flexpoint enables 10x ILP increase and low power consumption
85
Source: elektroniknet
Quantitative Analysis of Deep Learning
Architectures
• Each N-neuron Hidden Layer (HL) requires a NxN GEMM
• 2D NxN Systolic Array carries out NxN GEMM in 2N+1 cycles.
86
0,00001
0,0001
0,001
0,01
0,1
1
10
Seconds/GEMM
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
0,01
0,1
1
10
100
1000
10000
GB/s
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
1,E+07
1,E+09
1,E+11
1,E+13
1,E+15
1,E+17
OP/s
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
Quantitative Analysis of Deep Learning
Architectures
• Each N-neuron Hidden Layer (HL) requires a NxN GEMM
• 2D NxN Systolic Array carries out NxN GEMM in 2N+1 cycles.
87
0,00001
0,0001
0,001
0,01
0,1
1
10
Seconds/GEMM
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
0,01
0,1
1
10
100
1000
10000
GB/s
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
1,E+07
1,E+09
1,E+11
1,E+13
1,E+15
1,E+17
OP/s
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
HL of ~1024 neurons can identify
simple images
– 28x28 pixel images
– Each image contains a digit 0-9
Quantitative Analysis of Deep Learning
Architectures
• Each N-neuron Hidden Layer (HL) requires a NxN GEMM
• 2D NxN Systolic Array carries out NxN GEMM in 2N+1 cycles.
88
0,00001
0,0001
0,001
0,01
0,1
1
10
Seconds/GEMM
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
0,01
0,1
1
10
100
1000
10000
GB/s
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
1,E+07
1,E+09
1,E+11
1,E+13
1,E+15
1,E+17
OP/s
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
HL of ~4096 neurons can identify
images containing a single
concept
– 32x32 pixel images
– Each image is classified by
categories like “ship”, “cat” or
“deer”.
Quantitative Analysis of Deep Learning
Architectures
• Each N-neuron Hidden Layer (HL) requires a NxN GEMM
• 2D NxN Systolic Array carries out NxN GEMM in 2N+1 cycles.
89
0,00001
0,0001
0,001
0,01
0,1
1
10
Seconds/GEMM
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
0,01
0,1
1
10
100
1000
10000
GB/s
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
1,E+07
1,E+09
1,E+11
1,E+13
1,E+15
1,E+17
OP/s
Matrix Size
166MHz
1000MHz
1500MHz
2500MHz
Lake Crest’s mem BW (~TB/s)
targets very large HL with
O(10,000-100,000) neurons
These NN are
used for complex
image analysis
BSC Proposal for Deep Learning
90
16 2D-systolic arrays 4096x4096@1GHz: 134TOP/s
4 HBM stacks (16GB@1TB/s each): 64 GB @ 4TB/s
DDR5 SDRAM (384GB@180GB/s): 384GB @ 0.18TB/s
SoC
HBM HBM
HBMHBM
Interposer
HBM MC HBM MC
HBM MCHBM MC
DDR
DDR
Switch General
Purpouse
SoC
Switch
Syst Arrays
Syst Arrays
Human Brain Project
• 10-year, 1000M€ FET flagship project
• Goal: to pull together all existing
knowledge about the human brain and
to reconstruct the brain in
supercomputer based models and
simulations.
91
Expected outcomes: new treatments for brain disease
and new brain-like computing technologies
BSC role: Provision and optimisation of programming
models to allow simulations to be developed efficiently
MareNostrum part of the HPC platform for simulations
View from Europe: SpiNNaker machine
• HBP platform
– 500,000 cores
– 6 cabinets
(including server)
• Launch
– 30 March 2016
92
IBM TrueNorth Processor
• 64*64=4096 cores
• 256 neurons/core, 64K synapses/core
• 104Kb/core memory
• 65Kb for synapse states
• 32Kb for neuron states/parameters
• 6Kb for router destination addresses
• 1Kb for axonal delays
• 20mW/cm2 power density
• 72mW at 0.75V
• 46 Billion SOPS/Watt (Synoptic
Operations Per Second) typ.
• 400 Billion SOPS/Watt max.
• Compared to SoA supercomputer
at 4.5 Billion FLOPS/Watt
Source: Science magazine
View from Europe: Heidelberg HICANN
• Wafer-scale analogue neuromorphic system
• 8” 180nm wafer:
• 200,000 neurons
• 50M synapses
• 104x faster than biology
94
Quantum Computing: Brave New World of
post-Moore architecture
Quantum Processors
– Dwave
– IBM
– Microsoft
– Google
– View from Europe: Delft University Prototypes
D-Wave Quantum Processor
Environment colder than space
Leverages superconducting quantum effect
1000 qbits, 128K josephson junctions
Installed at NSA, Google, UCSB
108X faster than Quantum Monte Carlo
Algorithm on a single core*
• Source: Denchev et al. What is the Computational Value of Finite-Range Tunneling?
Phys. Rev. August 2016
IBM
Building “Universal Quantum Computer”
Developed a Quantum Computing API to make developing quantum
applications easier
Promotes experimentation on
publicly available 5-qbit quantum
processor
Microsoft and Google
Microsoft is looking into topological quantum computing in their
global “Station Q” research consortium
Microsoft has “Quarc” lab working actively in quantum computer
architecture in Redmond.
Google manufactured a 9-bit Quantum Computer in their Quantum
AI Lab
Google ambition is to produce a viable quantum computer in the
next five years*
• Mohseni et al: “Commercialize quantum technologies in five years”,
Nature (Comments) March 2017
View from Europe: Delft Quantum
Prototypes
50M Euro grant from Intel
Building hybrid CMOS/Quantum processor
Doing algorithms, compilers, architecture*
* (To appear in DAC 2017) Riesebos et al. “Pauli Frames
for Quantum Computer Architectures”
www.bsc.es
THANK YOU!

More Related Content

What's hot

08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer Fugaku
RCCSRENKEI
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
Ganesan Narayanasamy
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
inside-BigData.com
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
Ryousei Takano
 
04 New opportunities in photon science with high-speed X-ray imaging detecto...
04 New opportunities in photon science with high-speed X-ray imaging  detecto...04 New opportunities in photon science with high-speed X-ray imaging  detecto...
04 New opportunities in photon science with high-speed X-ray imaging detecto...
RCCSRENKEI
 
Hardware architecture of Summit Supercomputer
 Hardware architecture of Summit Supercomputer Hardware architecture of Summit Supercomputer
Hardware architecture of Summit Supercomputer
VigneshwarRamaswamy
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Ryousei Takano
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
CastLabKAIST
 
Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software research
Ryousei Takano
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
Shien-Chun Luo
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
Ryousei Takano
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLA
Shien-Chun Luo
 
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM Research
 
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
Ai Forum at Computex 2017 - Keynote Slides by Jensen HuangAi Forum at Computex 2017 - Keynote Slides by Jensen Huang
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
NVIDIA Taiwan
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Ganesan Narayanasamy
 
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCExceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
inside-BigData.com
 
The Tofu Interconnect D for the Post K Supercomputer
The Tofu Interconnect D for the Post K SupercomputerThe Tofu Interconnect D for the Post K Supercomputer
The Tofu Interconnect D for the Post K Supercomputer
inside-BigData.com
 
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore EraFlow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Ryousei Takano
 
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
Edge AI and Vision Alliance
 

What's hot (20)

08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer Fugaku
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
04 New opportunities in photon science with high-speed X-ray imaging detecto...
04 New opportunities in photon science with high-speed X-ray imaging  detecto...04 New opportunities in photon science with high-speed X-ray imaging  detecto...
04 New opportunities in photon science with high-speed X-ray imaging detecto...
 
Hardware architecture of Summit Supercomputer
 Hardware architecture of Summit Supercomputer Hardware architecture of Summit Supercomputer
Hardware architecture of Summit Supercomputer
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software research
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLA
 
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
 
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
Ai Forum at Computex 2017 - Keynote Slides by Jensen HuangAi Forum at Computex 2017 - Keynote Slides by Jensen Huang
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCExceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
 
The Tofu Interconnect D for the Post K Supercomputer
The Tofu Interconnect D for the Post K SupercomputerThe Tofu Interconnect D for the Post K Supercomputer
The Tofu Interconnect D for the Post K Supercomputer
 
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore EraFlow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
 
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
 

Similar to Barcelona Supercomputing Center, Generador de Riqueza

Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
BigDataEverywhere
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
Sagar Dolas
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
inside-BigData.com
 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
Rakuten Group, Inc.
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
inside-BigData.com
 
Future Commodity Chip Called CELL for HPC
Future Commodity Chip Called CELL for HPCFuture Commodity Chip Called CELL for HPC
Future Commodity Chip Called CELL for HPC
Slide_N
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
inside-BigData.com
 
Opportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCIOpportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCI
Ryousei Takano
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
Sagar Dolas
 
The Coming Age of Extreme Heterogeneity in HPC
The Coming Age of Extreme Heterogeneity in HPCThe Coming Age of Extreme Heterogeneity in HPC
The Coming Age of Extreme Heterogeneity in HPC
inside-BigData.com
 
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
LllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzjLllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
ManhHoangVan
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
Wilhelm van Belkum
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
NECST Lab @ Politecnico di Milano
 
Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
inside-BigData.com
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 
12 la bel_soc overview
12 la bel_soc overview12 la bel_soc overview
12 la bel_soc overviewHema Chandran
 
6Tisch telecom_bretagne_2016
6Tisch telecom_bretagne_20166Tisch telecom_bretagne_2016
6Tisch telecom_bretagne_2016
Pascal Thubert
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
Intel® Software
 
NSCC Training Introductory Class
NSCC Training Introductory Class NSCC Training Introductory Class
NSCC Training Introductory Class
National Supercomputing Centre Singapore
 

Similar to Barcelona Supercomputing Center, Generador de Riqueza (20)

Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
Future Commodity Chip Called CELL for HPC
Future Commodity Chip Called CELL for HPCFuture Commodity Chip Called CELL for HPC
Future Commodity Chip Called CELL for HPC
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
Opportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCIOpportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCI
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
The Coming Age of Extreme Heterogeneity in HPC
The Coming Age of Extreme Heterogeneity in HPCThe Coming Age of Extreme Heterogeneity in HPC
The Coming Age of Extreme Heterogeneity in HPC
 
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
LllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzjLllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
12 la bel_soc overview
12 la bel_soc overview12 la bel_soc overview
12 la bel_soc overview
 
6Tisch telecom_bretagne_2016
6Tisch telecom_bretagne_20166Tisch telecom_bretagne_2016
6Tisch telecom_bretagne_2016
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Anegdotic Maxeler (Romania)
  Anegdotic Maxeler (Romania)  Anegdotic Maxeler (Romania)
Anegdotic Maxeler (Romania)
 
NSCC Training Introductory Class
NSCC Training Introductory Class NSCC Training Introductory Class
NSCC Training Introductory Class
 

More from Facultad de Informática UCM

¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?
Facultad de Informática UCM
 
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
Facultad de Informática UCM
 
DRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersDRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation Computers
Facultad de Informática UCM
 
uElectronics ongoing activities at ESA
uElectronics ongoing activities at ESAuElectronics ongoing activities at ESA
uElectronics ongoing activities at ESA
Facultad de Informática UCM
 
Tendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura ArmTendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura Arm
Facultad de Informática UCM
 
Formalizing Mathematics in Lean
Formalizing Mathematics in LeanFormalizing Mathematics in Lean
Formalizing Mathematics in Lean
Facultad de Informática UCM
 
Introduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented ComputingIntroduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented Computing
Facultad de Informática UCM
 
Computer Design Concepts for Machine Learning
Computer Design Concepts for Machine LearningComputer Design Concepts for Machine Learning
Computer Design Concepts for Machine Learning
Facultad de Informática UCM
 
Inteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuroInteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuro
Facultad de Informática UCM
 
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 Design Automation Approaches for Real-Time Edge Computing for Science Applic... Design Automation Approaches for Real-Time Edge Computing for Science Applic...
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
Facultad de Informática UCM
 
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Facultad de Informática UCM
 
Fault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error CorrectionFault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error Correction
Facultad de Informática UCM
 
Cómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intentoCómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intento
Facultad de Informática UCM
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPC
Facultad de Informática UCM
 
Type and proof structures for concurrency
Type and proof structures for concurrencyType and proof structures for concurrency
Type and proof structures for concurrency
Facultad de Informática UCM
 
Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...
Facultad de Informática UCM
 
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Facultad de Informática UCM
 
Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?
Facultad de Informática UCM
 
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Facultad de Informática UCM
 
Challenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore windChallenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore wind
Facultad de Informática UCM
 

More from Facultad de Informática UCM (20)

¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?
 
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
 
DRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersDRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation Computers
 
uElectronics ongoing activities at ESA
uElectronics ongoing activities at ESAuElectronics ongoing activities at ESA
uElectronics ongoing activities at ESA
 
Tendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura ArmTendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura Arm
 
Formalizing Mathematics in Lean
Formalizing Mathematics in LeanFormalizing Mathematics in Lean
Formalizing Mathematics in Lean
 
Introduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented ComputingIntroduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented Computing
 
Computer Design Concepts for Machine Learning
Computer Design Concepts for Machine LearningComputer Design Concepts for Machine Learning
Computer Design Concepts for Machine Learning
 
Inteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuroInteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuro
 
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 Design Automation Approaches for Real-Time Edge Computing for Science Applic... Design Automation Approaches for Real-Time Edge Computing for Science Applic...
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
 
Fault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error CorrectionFault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error Correction
 
Cómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intentoCómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intento
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPC
 
Type and proof structures for concurrency
Type and proof structures for concurrencyType and proof structures for concurrency
Type and proof structures for concurrency
 
Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...
 
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
 
Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?
 
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
 
Challenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore windChallenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore wind
 

Recently uploaded

Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 

Recently uploaded (20)

Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 

Barcelona Supercomputing Center, Generador de Riqueza

  • 1. From Classical to Runtime Aware Architectures Madrid, 25 Abril 2017 Workshop Syec 25-26 April Prof. Mateo Valero BSC Director Cursos de Postgrado
  • 2. Technological Achievements Transistor (Bell Labs, 1947) • DEC PDP-1 (1957) • IBM 7090 (1960) Integrated circuit (1958) • IBM System 360 (1965) • DEC PDP-8 (1965) Microprocessor (1971) • Intel 4004
  • 3. Birth of the Revolution – The Intel 4004 Introduced November 15, 1971 108KHz, 50 KIPs, 2300 10μ transistors
  • 4. Sunway TaihuLight • SW26010 processor (Chinese design, ISA, & fab) • 1.45 GHz • Node = 260 Cores (1 socket) • 4 – core groups • 32 GB memory • 40,960 nodes in the system • 10,649,600 cores total • 1.31 PB of primary memory (DDR3). • 125.4 Pflop/s theoretical peak • 93 Pflop/s HPL, 74% peak • 15.3 Mwatts water cooled • 3 of the 6 finalists for Gordon Bell Award@SC16
  • 5. Top 500 Supercomputers - November 2016 Rank Name Site Computer Total Cores Rmax Rpeak Power Mflops/W 1 Sunway TaihuLight National Supercomputing Center in Wuxi Sunway MPP, Sunway SW26010 260C 1.45GHz, Sunway 10649600 93014593, 88 125435904 15371 6051,3 2 Tianhe-2 (MilkyWay-2) National Super Computer Center in Guangzhou TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P 3120000 33862700 54902400 17808 1901,54 3 Titan DOE/SC/Oak Ridge National Laboratory Cray XK7 , Opteron 6274 16C 2.200GHz, Cray Gemini interconnect, NVIDIA K20x 560640 17590000 27112550 8209 2142,77 4 Sequoia DOE/NNSA/LLNL BlueGene/Q, Power BQC 16C 1.60 GHz, Custom 1572864 17173224 20132659,2 7890 2176,58 5 Cori DOE/SC/LBNL/NERSC Cray XC40, Intel Xeon Phi 7250 68C 1.4GHz, Aries interconnect 622336 14014700 27880653 3939 3557,93 6 Oakforest- PACS Joint Center for Advanced High Performance Computing PRIMERGY CX1640 M1, Intel Xeon Phi 7250 68C 1.4GHz, Intel Omni-Path 556104 13554600 24913459 2718,7 4985,69 7 RIKEN Advanced Institute for Computational Science (AICS) K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect 705024 10510000 11280384 12659,89 830,18 8 Piz Daint Swiss National Supercomputing Centre (CSCS) Cray XC50, Xeon E5-2690v3 12C 2.6GHz, Aries interconnect , NVIDIA Tesla P100 206720 9779000 15987968 1312 7453,51 9 Mira DOE/SC/Argonne National Laboratory BlueGene/Q, Power BQC 16C 1.60GHz, Custom 786432 8586612 10066330 3945 2176,58 10 Trinity DOE/NNSA/LANL/SNL Cray XC40, Xeon E5-2698v3 16C 2.3GHz, Aries interconnect 301056 8100900 11078861 4232,63 1913,92
  • 6. Performance Development of HPC over the Last 23 Years from the Top500 0,1 1 10 100 1000 10000 100000 1000000 10000000 100000000 1E+09 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 59.7 GFlop/s 400 MFlop/s 1.17 TFlop/s 93 PFlop/s 286 TFlop/s 567 PFlop/s SUM N=1 N=500 1 Gflop/s 1 Tflop/s 100 Mflop/s 100 Gflop/s 100 Tflop/s 10 Gflop/s 10 Tflop/s 1 Pflop/s 100 Pflop/s 10 Pflop/s 1 Eflop/s
  • 8. 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 20071987 1989 Our origins...Plan Nacional de Investigación High-performance Computing group @ Computer Architecture Department (UPC) Relevance High-speed Low-cost Parallel Architecture Design PA85-0314 High Performance Computing TIC95-429 Architectures and Compilers for Supercomputers TIC92-880 Parallelism Exploitation in High Speed Architectures TIC89-299 High Performance Computing II TIC98-511-C02-01 High Performance Computing III TIC2001-995-C02-01 High Performance Computing IV TIN2004-07739-C02-01 High Performance Computing VI TIN2012-34557 2008 - 2011 2012 - 20151988 High Performance Computing V TIN2007-60625 CEPBA CIRI BSC COMPAQ INTEL MICROSOFT IBM INTEL (Exascale) NVIDIA REPSOL SAMSUNG IBERDROLA Excellence
  • 9. Venimos de muy lejos… 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 20071987 1988 1989 2008 200919861985 2010 IBM PP970 / Myrinet MareNostrum 42.35, 94.21 Tflop/s IBM RS-6000 SP & IBM p630 192+144 Gflop/s SGI Origin 2000 32 Gflop/s Connection Machine CM-200 0,64 Gflop/s Convex C3800 Compaq GS-140 12.5 Gflop/s Compaq GS-160 23.4 Gflop/s Parsys Multiprocessor Parsytec CCi-8D 4.45 Gflop/s BULL NovaScale 5160 48 Gflop/s Research prototypes Transputer cluster SGI Altix 4700 819.2 Gflops SL8500 6 Petabytes Maricel 14.4 Tflops, 20 KW
  • 10. Barcelona Supercomputing Center Centro Nacional de Supercomputación Spanish Government 60% Catalonian Government 30% Univ. Politècnica de Catalunya (UPC) 10% BSC-CNS is a consortium that includes BSC-CNS objectives Supercomputing services to Spanish and EU researchers R&D in Computer, Life, Earth and Engineering Sciences PhD programme, technology transfer, public engagement
  • 11. Barcelona Supercomputing Center Centro Nacional de Supercomputación 475 people from 44 countries *31th of December 2016 Competitive project funding secured (2005 to 2017) Total 144,8 M€Information compiled 16/01/2017 Europe 71,9M€ National 34 M€ Companies 38,9 M€
  • 12. The MareNostrum 3 Supercomputer Over 1015 Floating Point Operations per second 70% PRACE 24% RES 6% BSC-CNS 3 PB of disk storage 100.8 TB of main memory Nearly 50,000 cores
  • 13. The MareNostrum 4 Supercomputer Total peak performance 13,7 Pflops/s 12 times more powerful than MareNostrum 3 Compute General Purpose, for current BSC workload More than 11 Pflops/s With 3,456 nodes of Intel Xeon V5 processors Emerging Technologies, for evaluation of 2020 Exascale systems 3 systems, each of more than 0,5 Pflops/s with KLN/KNH, Power+NVIDIA, ARMv8 Storage More than 10 PB of GPFS Elastics Storage System Network IB EDR/OPA Ethernet Operating System: SuSE
  • 14. Mission of BSC Scientific Departments Earth Sciences CASE Computer Sciences Life Sciences To influence the way machines are built, programmed and used: computer architecture, programming models, performance tools, Big Data, Artificial Intelligence To develop and implement global and regional state-of-the-art models for short- term air quality forecast and long-term climate applications To understand living organisms by means of theoretical and computational methods (molecular modeling, genomics, proteomics) To develop scientific and engineering software to efficiently exploit super-computing capabilities (biomedical, geophysics, atmospheric, energy, social and economic simulations)
  • 15. Design of Superscalar Processors Simple interface Sequential program ILP ISA Programs “decoupled” from hardware Applications Decoupled from the software stack
  • 16. Latency Has Been a Problem from the Beginning...  • Feeding the pipeline with the right instructions: • HW/SW trace cache (ICS’99) • Prophet/Critic Hybrid Branch Predictor (ISCA’04) • Locality/reuse • Cache Memory with Hybrid Mapping (IASTED87). Victim Cache  • Dual Data Cache (ICS¨95) • A novel renaming mechanism that boosts software prefetching (ICS’01) • Virtual-Physical Registers (HPCA’98) • Kilo Instruction Processors (ISHPC03,HPCA’06, ISCA’08) Fetch Decode Rename Instruction Window Wakeup+ select Register file Bypass DataCache Register Write Commit
  • 17. … and the Power Wall Appeared Later  • Better Technologies • Two-level organization (Locality Exploitation) • Register file for Superscalar (ISCA’00) • Instruction queues (ICCD’05) • Load/Store Queues (ISCA’08) • Direct Wakeup, Pointer-based Instruction Queue Design (ICCD’04, ICCD’05) • Content-aware register file (ISCA’09) • Fuzzy computation (ICS’01, IEEE CAL’02, IEEE-TC’05). Currently known as Approximate Computing  Fetch Decode Rename Instruction Window Wakeup+ select Register file Bypass DataCache Register Write Commit
  • 18. Fuzzy computation Accuracy Size Performance @ Low Power Binary systems (bmp) Compresion protocols (jpeg) Fuzzy Computation This one only used ~85% of the time while consuming ~75% of the power This image is the original one
  • 19. SMT and Memory Latency …  • Simultaneous Multithreading (SMT) • Benefits of SMT Processors: • Increase core resource utilization • Basic pipeline unchanged: • Few replicated resources, other shared • Some of our contributions: • Dynamically Controlled Resource Allocation (MICRO 2004) • Quality of Service (QoS) in SMTs (IEEE TC 2006) • Runahead Threads for SMTs (HPCA 2008) Fetch Decode Rename Instruction Window Wakeup+ select Register file Bypass DataCache Register Write Commit Thread 1 Thread N
  • 20. Time Predictability (in multicore and SMT processors) • Where is it required: • Increasingly required in handheld/desktop devices • Also in embedded hard real-time systems (cars, planes, trains, …) • How to achieve it: • Controlling how resources are assigned to co-running tasks • Soft real-time systems • SMT: DCRA resource allocation policy (MICRO 2004, IEEE Micro 2004) • Multicores: Cache partitioning (ACM OSR 2009, IEEE Micro 2009) • Hard real-time systems • Deterministic resource ‘securing’ (ISCA 2009) • Time-Randomised designs (DAC 2014 best paper award) QoS spaceDefinition: • Ability to provide a minimum performance to a task • Requires biasing processor resource allocation
  • 21. Vector Architectures… Memory Latency and Power  • Out-of-Order Access to Vectors (ISCA 1992, ISCA 1995) • Command Memory Vector (PACT 1998) • In-memory computation • Decoupling Vector Architectures (HPCA 1996) • Cray SX1 • Out-of-order Vector Architectures (Micro 1996) • Multithreaded Vector Architectures (HPCA 1997) • SMT Vector Architectures (HICS 1997, IEEE MICRO J. 1997) • Vector register-file organization (PACT 1997) • Vector Microprocessors (ICS 1999, SPAA 2001) • Architectures with Short Vectors (PACT 1997, ICS 1998) • Tarantula (ISCA 2002), Knights Corner • Vector Architectures for Multimedia (HPCA 2001, Micro 2002) • High-Speed Buffers Routers (Micro 2003, IEEE TC 2006) • Vector Architectures for Data-Base (Micro 2012, HPCA2015,ISCA2016)
  • 22. Statically scheduled VLIW architectures • Power-efficient FU • Clustering • Widening (MICRO-98) • μSIMD and multimedia vector units (ICPP-05) • Locality-aware RF • Sacks (CONPAR-94) • Non-consistent (HPCA95) • Two-level hierarchical (MICRO-00) • Integrated modulo scheduling techniques, register allocation and spilling (MICRO-95, PACT-96, MICRO-96, MICRO-01)
  • 23. The MultiCore Era Moore’s Law + Memory Wall + Power Wall UltraSPARC T2 (2007) Intel Xeon 7100 (2006) POWER4 (2001) Chip MultiProcessors (CMPs)
  • 24. How Multicores Were Designed at the Beginning? IBM Power4 (2001) • 2 cores, ST • 0.7 MB/core L2, 16MB/core L3 (off-chip) • 115W TDP • 10GB/s mem BW IBM Power7 (2010) • 8 cores, SMT4 • 256 KB/core L2 16MB/core L3 (on-chip) • 170W TDP • 100GB/s mem BW IBM Power8 (2014) • 12 cores, SMT8 • 512 KB/core L2 8MB/core L3 (on-chip) • 250W TDP • 410GB/s mem BW
  • 25. How To Parallelize Future Applications? • From sequential to parallel codes • Efficient runs on manycore processors implies handling: • Massive amount of cores and available parallelism • Heterogeneous systems • Same or multiple ISAs • Accelerators, specialization • Deep and heterogeneous memory hierarchy • Non-Uniform Memory Access (NUMA) • Multiple address spaces • Stringent energy budget • Load Balancing Programmability Wall Interconnect L2 L2 DRAM DRAM MC L3 L3 L3L3 MRAM MRAM C C C C ClusterInterconnect C C C C C C C C ClusterInterconnect C C C C C CA A
  • 26. Living in the Programming Revolution Multicores made the interface to leak… ISA / API Parallel hardware with multiple address spaces (hierarchy, transfer), control flows, … Applications Parallel application logic + Platform specificites Applications
  • 27. The efforts are focused on efficiently using the underlying hardware ISA / API Vision in the Programming Revolution Need to decouple again General purpose Single address space Application logic Arch. independentApplications Power to the runtime PM: High-level, clean, abstract interface
  • 28. History / Strategy SMPSs V2 ~2009 GPUSs ~2009 CellSs ~2006 SMPSs V1 ~2007 PERMPAR ~1994 COMPSs ~2007 NANOS ~1996 COMPSs ServiceSs ~2010 COMPSs ServiceSs PyCOMPSs ~2013 OmpSs ~2008 OpenMP … 3.0 …. 4.0 …. StarSs ~2008 DDT @ Parascope ~1992 2008 2013 Forerunner of OpenMP GridSs ~2002
  • 29. OmpSs A forerunner for OpenMP + Prototype of tasking + Task dependences + Task priorities + Taskloop prototyping + Task reductions + Dependences on taskwaits + OMPT impl. + Multidependences + Commutative + Dependences on taskloops today
  • 30. OmpSs: data-flow execution of sequential programs void Cholesky( float *A ) { int i, j, k; for (k=0; k<NT; k++) { spotrf (A[k*NT+k]) ; for (i=k+1; i<NT; i++) strsm (A[k*NT+k], A[k*NT+i]); // update trailing submatrix for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) sgemm( A[k*NT+i], A[k*NT+j], A[j*NT+i]); ssyrk (A[k*NT+i], A[i*NT+i]); } }#pragma omp task inout ([TS][TS]A) void spotrf (float *A); #pragma omp task input ([TS][TS]A) inout ([TS][TS]C) void ssyrk (float *A, float *C); #pragma omp task input ([TS][TS]A,[TS][TS]B) inout ([TS][TS]C) void sgemm (float *A, float *B, float *C); #pragma omp task input ([TS][TS]T) inout ([TS][TS]B) void strsm (float *T, float *B); Decouple how we write applications form how they are executed Write Execute Clean offloading to hide architectural complexities
  • 31. OmpSs: A Sequential Program … void vadd3 (float A[BS], float B[BS], float C[BS]); void scale_add (float sum, float A[BS], float B[BS]); void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) //sum(C[i]) accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); ... for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]);
  • 32. OmpSs: …Taskified… #pragma css task input(A, B) output(C) void vadd3 (float A[BS], float B[BS], float C[BS]); #pragma css task input(sum, A) inout(B) void scale_add (float sum, float A[BS], float B[BS]); #pragma css task input(A) inout(sum) void accum (float A[BS], float *sum); for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) //sum(C[i]) accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); ... for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]); 1 2 3 4 13 14 15 16 5 6 87 17 9 18 10 19 11 20 12 Color/number: order of task instantiation Some antidependences covered by flow dependences not drawn Write
  • 33. Decouple how we write form how it is executed … and Executed in a Data-Flow Model #pragma css task input(A, B) output(C) void vadd3 (float A[BS], float B[BS], float C[BS]); #pragma css task input(sum, A) inout(B) void scale_add (float sum, float A[BS], float B[BS]); #pragma css task input(A) inout(sum) void accum (float A[BS], float *sum); 1 1 1 2 2 2 2 3 2 3 54 7 6 8 6 7 6 8 7 for (i=0; i<N; i+=BS) // C=A+B vadd3 ( &A[i], &B[i], &C[i]); ... for (i=0; i<N; i+=BS) //sum(C[i]) accum (&C[i], &sum); ... for (i=0; i<N; i+=BS) // B=sum*A scale_add (sum, &E[i], &B[i]); ... for (i=0; i<N; i+=BS) // A=C+D vadd3 (&C[i], &D[i], &A[i]); ... for (i=0; i<N; i+=BS) // E=G+F vadd3 (&G[i], &F[i], &E[i]); Write Execute Color/number: a possible order of task execution
  • 34. OmpSs: Potential of Data Access Info • Flat global address space seen by programmer • Flexibility to dynamically traverse dataflow graph “optimizing” • Concurrency. Critical path • Memory access: data transfers performed by run time • Opportunities for automatic • Prefetch • Reuse • Eliminate antidependences (rename) • Replication management • Coherency/consistency handled by the runtime • Layout changes Processor CPU On-chip cache Off-chip BW CPU Main Memory
  • 35. PPU User main program CellSs PPU lib SPU0 DMA in Task execution DMA out Synchronization CellSs SPU lib Original task code Helper threadmain thread Memory User data Renaming Task graph Synchronization Tasks Finalization signal Stage in/out data Work assignment Data dependence Data renaming Scheduling SPU1 SPU2 SPE threads FUFUFU Helper thread CellSs implementation IFU REG ISSIQRENDEC RET Main thread P. Bellens, et al, “CellSs: A Programming Model for the Cell BE Architecture” SC’06. P. Bellens, et al, “CellSs: Programming the Cell/B.E. made easier” IBM JR&D 2007
  • 36. Renaming @ Cell • Experiments on the CellSs (predecessor of OmpSs) • Renaming to avoid anti-dependences • Eager (similarly done at SS designs) • At task instantiation time • Lazy (similar to virtual registers) • Just before task execution P. Bellens, et al, “CellSs: Scheduling Techniques to Better Exploit Memory Hierarchy” Sci. Prog. 2009 Main memory transfers (cold) Main Memory transfers (capacity) Killed transfers SMPSs: Stream benchmark reduction in execution time SMPSs: Jacobi reduciton in # remanings
  • 37. Data Reuse @ Cell P. Bellens, et al, “CellSs: Scheduling Techniques to Better Exploit Memory Hierarchy” Sci. Prog. 2009 Matrix-matrix multiply • Experiments on the CellSs • Data Reuse • Locality arcs in dependence graph • Good locality but high overhead  no time improvement
  • 38. Reducing Data Movement @ Cell • Experiments on the CellSs (predecessor of OmpSs) • Bypassing / global software cache • Distributed implementation • @each SPE • Using object descriptors managed atomically with specific hardware support (line level LL-SC) Main memory: cold Main memory: capacity Global software cache Local software cache P. Belens et al, “Making the Best of Temporal Locality: Just-In-Time Renaming and Lazy Write-Back on the Cell/B.E.” IJHPC 2010 DMA Reads
  • 39. GPUSs implementation • Architecture implications • Large local store O(GB)  large task granularity  Good • Data transfers: Slow, non overlapped  Bad • Cache management • Write-through • Write-back • Run time implementation • Powerful main processor and multiple cores • Dumb accelerator (not able to perform data transfers, implement software cache,…) Slave threads FUFUFU Helper thread IFU REG ISSIQRENDEC RET Main thread E. Ayguade, et al, “An Extension of the StarSs Programming Model for Platforms with Multiple GPUs” Europar 2009
  • 40. Prefetching @ multiple GPUs • Improvements in runtime mechanisms (OmpSs + CUDA) • Use of multiple streams • High asynchrony and overlap (transfers and kernels) • Overlap kernels • Take overheads out of the critical path • Improvement in schedulers • Late binding of locality aware decisions • Propagate priorities J. Planas et al, “Optimizing Task-based Execution Support on Asynchronous Devices.” Submitted Nbody Cholesky
  • 41. ISA / API Runtime Aware Architectures The runtime drives the hardware design Applications Runtime PM: High-level, clean, abstract interface Task based PM annotated by the user Data dependencies detected at runtime Dynamic scheduling “Reuse” architectural ideas under new constraints
  • 42. Superscalar vision at Multicore level Programmability Wall Resilience Wall Memory Wall Power Wall Superscalar World Out-of-Order, Kilo-Instruction Processor, Distant Parallelism Branch Predictor, Speculation Fuzzy Computation Dual Data Cache, Sack for VLIW Register Renaming, Virtual Regs Cache Reuse, Prefetching, Victim C. In-memory Computation Accelerators, Different ISA’s, SMT Critical Path Exploitation Resilience Multicore World Task-based, Data-flow Graph, Dynamic Parallelism Tasks Output Prediction, Speculation Hybrid Memory Hierarchy, NVM Late Task Memory Allocation Data Reuse, Prefetching In-memory FU’s Heterogeneity of Tasks and HW Task-criticality Resilience Load Balancing and Scheduling Interconnection Network Data Movement
  • 43. Architecture Proposals in RoMoL C C L1 ClusterInterconnect LM L1 LM C C L1 LM L1 LM C C L1 LM L1 LM C C L1 LM L1 LM Stacked DRAM External DRAM L2 L3 cache Cluster Interconnect - Priority-based arbitration - By-pass routing Runtime Support Unit - DVFS - Light-weight deps tracking - Task memoization - Reduced data motion Vectors - DB, sorting - BTrees Cache Hierarchy - LM usage - Coherence - Eviction policies - Reductions PICOS
  • 44. Runtime Management of Local Memories (LM) LM Management in OmpSs – Task inputs and outputs mapped to the LMs – Runtime manages DMA transfers 8.7% speedup in execution time 14% reduction in power 20% reduction in network-on-chip traffic 0,8 0,9 1 1,1 1,2 jacobi kmeans md5 tinyjpeg vec_add vec_red Speedup Cache Hybrid Ll. Alvarez et al. Transparent Usage of Hybrid on-Chip Memory Hierarchies in Multicores. ISCA 2015. Ll. Alvarez et al Runtime-Guided Management of Scratchpad Memories in Multicore Architectures. PACT 2015 C C L1 ClusterInterconnect LM L1 LM C C L1 LM L1 LM C C L1 LM L1 LM C C L1 LM L1 LM Stacked DRAM External DRAM L2 L3 cache PICOS
  • 45. OmpSs in Heterogeneous Systems Heterogeneous systems • Big-little processors • Accelerators • Hard to program big little big big big little little little Task-based programming models can adapt to these scenarios • Detect tasks in the critical path and run them in fast cores • Non-critical tasks can run in slower cores • Assign tasks to the most energy-efficient HW component • Runtime takes core of balancing the load • Same performance with less power consumption
  • 46. Architectural Support for DVFS Reduce overheads of software solution – Serialization in DVFS reconfigurations – User-kernel mode switches Runtime Support Unit (RSU) – Power budget – State of cores – Criticality of running tasks Runtime system notifies RSU – Start task execution • Criticality • Running core – End task execution Same algorithm for DVFS reconfigurations Runtime system Scheduler HPRQ LPRQ A A NA NA State C C NC NC Criticality Power budget 2 SWHW Core 0 Core 1 Core 2 Core 3 DVFS Cntrl RSU
  • 47. Architectural Support for DVFS Runtime system Scheduler HPRQ LPRQ A A NA NA State C C NC NC Criticality Power budget 2 SWHW Core 0 Core 1 Core 2 Core 3 DVFS Cntrl RSU 0,6 0,7 0,8 0,9 1 1,1 1,2 1,3 1,4 Speedup FIFO CATS CATA CATA+RSU TurboMode 0,4 0,5 0,6 0,7 0,8 0,9 1 1,1 EDP E. Castillo, CATA: Criticality Aware Task Acceleration for Multicore Processors (IPDPS’16)
  • 48. TaskSuperscalar (TaskSs) Pipeline • Hardware design for a distributed task superscalar pipeline frontend (MICRO’10) • Can be embedded into any manycore fabric • Drive hundreds of threads • Work windows of thousands of tasks • Fine grain task parallelism • TaskSs components: • Gateway (GW): Allocate resources for task meta-data • Object Renaming Table (ORT) • Map memory objects to producer tasks • Object Versioning Table (OVT) • Maintain multiple object versions • Task Reservation Stations (TRS) • Store and track task in-flght meta-data • Implementing TaskSs @ Xilinx Zynq GW TRS ORT Ready Queue OVT TaskSs pipeline Scheduler C C C C C C C C C C C C C C C C Multicore Fabric Y. Etsion et al, “Task Superscalar: An Out-of-Order Task Pipeline” MICRO-43, 2010
  • 49. Architectural Support for Task Dependence Management (TDM) with Flexible Software Scheduling • Task creation is a bottleneck since it involves dependence tracking • Our hardware proposal (TDM) • takes care of dependence tracking • exposes scheduling to the SW • Our results demonstrate that this flexibility allows TDM to beat the state-of-the-art E. Castillo et al, Architectural Support for Task Dependence Management with Flexible Software Scheduling submitted to MICRO’17)
  • 50. Approximate Task Memoization (ATM) • Approximate Task Memoization (ATM) aims at eliminating redundant computations. • ATM leverages runtime system metadatata to identify tasks that can be memoized. • ATM achieves 1.4x average speedup when only applying memoization techniques (Static ATM). • ATM achieves an increased 2.5x average speedup with an average 0.7% accuracy loss with task approcimation (Dynamic ATM). I. Brumar et al, ATM: Approximate Task Memoization in the Runtime System (IPDPS’17)
  • 51. Exploiting the Task Dependency Graph (TDG) to Reduce Coherence Traffic • To reduce coherence traffic, the state-of-the-art applies round-robin mechanisms at the runtime level. • Exploiting the information contained at the TDG level is effective to • improve performance • dramatically reduce coherence traffic (2.26x reduction with respect to the state-of-the-art). State-of-the-art Partition (DEP) Gauss-Seidel TDG DEP requires ~200GB of data transfer across a 288 cores system
  • 52. Exploiting the Task Dependency Graph (TDG) to Reduce Coherence Traffic • To reduce coherence traffic, the state- of-the-art applies round-robin mechanisms at the runtime level. • Exploiting the information contained at the TDG level is effective to • improve performance • dramatically reduce coherence traffic (2.26x reduction with respect to the state-of-the-art). Graph Algorithms-Driven Partition (RIP-DEP) Gauss-Seidel TDG RIP-DEP requires ~90GB of data transfer across a 288 cores system I. Sánchez et al, Reducing Data Movements on Shared Memory Architectures (submitted to SC’17)
  • 53. Dealing with a New Form Of Heterogeneity • Manufacturing Variability of CPUs – Different power consumption • Power variability becomes performance heterogeneity in power constrained environments • Typical load-balancing may not be sufficient • Redistributing power and number of active cores among sockets can improve performance even distribution optimal distribution
  • 54. Dynamic Analysis and Exploration • Statically trying all configurations is not practical • Huge overhead (one execution for each configuration) • Has to be performed on each node • Online analysis: Try multiple configurations in a single run. Performance Benefit Energy Savings D. Chasapis et al, Runtime-Guided Mitigation of Manufacturing Variability in Power-Constrained Multi-Socket NUMA Nodes (ICS’16)
  • 55. Introduction - A New Form Of Heterogeneity • Platform: 2 x sockets with 12 core Intel Xeon E5-2695v2 • Power variability becomes performance heterogeneity in power constrained environments
  • 56. Hash Join, Sorting, Aggregation, DBMS • Goal: Vector acceleration of data bases • “Real vector” extensions to x86 • Pipeline operands to the functional unit (like Cray machines, not like SSE/AVX) • Scatter/gather, masking, vector length register • Implemented in PTLSim + DRAMSim2 • Hash join work published in MICRO 2012 • 1.94x (large data sets) and 4.56x (cache resident data sets) of speedup for TPC-H • Memory bandwidth is the bottleneck • Sorting paper published in HPCA 2015 • Compare existing vectorized quicksort, bitonic mergesort, radix sort on a consistent platform • Propose novel approach (VSR) for vectorizing radix sort with 2 new instructions • Similarity with AVX512-CD instructions (but cannot use Intel’s instructions because the algorithm requires strict ordering) • Small CAM • 3.4x speedup over next-best vectorised algorithm with the same hardware configuration due to: • Transforming strided accesses to unit-stride • Elminating replicated data structures • Ongoing work on aggregations • Reduction to a group of values, not a single scalar value ISCA 2016 • Building from VSR work 0 2 4 6 8 10 12 14 16 18 20 22 mvl-8 mvl-16 mvl-32 mvl-64 mvl-8 mvl-16 mvl-32 mvl-64 mvl-8 mvl-16 mvl-32 mvl-64 mvl-8 mvl-16 mvl-32 mvl-64 quicksort bitonic radix vsr speedupoverscalarbaseline 1 lane 2 lanes 4 lanes
  • 57. Overlap Communication and Computation • Hybrid MPI/OmpSs: Linpack example • Extend asynchronous data-flow execution to outer level • Taskify MPI communication primitives • Automatic lookahead • Improved performance • Tolerance to network bandwidth • Tolerance to OS noise P0 P1 P2 V. Marjanovic et al, “Overlapping Communication and Computation by using a Hybrid MPI/SMPSs Approach” ICS 2010
  • 58. Effects on Bandwidth flattening communication pattern thus reducing bandwidth requirements *simulation on application with ring communication pattern V. Subotic et al. “Overlapping communication and computation by enforcing speculative data-flow”, January 2008, HiPEAC
  • 59. Related Work • Rigel Architecture (ISCA 2009) • No L1D, non-coherent L2, read-only, private and cluster-shared data • Global accesses bypass the L2 and go directly to L3 • SARC Architecture (IEEE MICRO 2010) • Throughput-aware architecture • TLBs used to access remote LMs and migrate data accross LMs • Runnemede Architecture (HPCA 2013) • Coherence islands (SW managed) + Hierarchy of LMs • Dataflow execution (codelets) • Carbon (ISCA 2007) • Hardware scheduling for task-based programs • Holistic run-time parallelism management (ICS 2013) • Runtime-guided coherence protocols (IPDPS 2014)
  • 60. RoMoL … papers • V. Marjanovic et al., “Effective communication and computation overlap with hybrid MPI/SMPSs.” PPoPP 2010 • Y. Etsion et al., “Task Superscalar: An Out-of-Order Task Pipeline.” MICRO 2010 • N. Vujic et al., “Automatic Prefetch and Modulo Scheduling Transformations for the Cell BE Architecture.” IEEE TPDS 2010 • V. Marjanovic et al., “Overlapping communication and computation by using a hybrid MPI/SMPSs approach.” ICS 2010 • T. Hayes et al., “Vector Extensions for Decision Support DBMS Acceleration”. MICRO 2012 • L. Alvarez,et al., “Hardware-software coherence protocol for the coexistence of caches and local memories.” SC 2012 • M. Valero et al., “Runtime-Aware Architectures: A First Approach”. SuperFRI 2014 • L. Alvarez,et al., “Hardware-Software Coherence Protocol for the Coexistence of Caches and Local Memories.” IEEE TC 2015
  • 61. RoMoL … papers • M. Casas et al., “Runtime-Aware Architectures”. Euro-Par 2015. • T. Hayes et al., “VSR sort: A novel vectorised sorting algorithm & architecture extensions for future microprocessors”. HPCA 2015 • K. Chronaki et al., “Criticality-Aware Dynamic Task Schedulling for Heterogeneous Architectures”. ICS 2015 • L. Alvarez et al., “Coherence Protocol for Transparent Management of Scratchpad Memories in Shared Memory Manycore Architectures”. ISCA 2015 • L. Alvarez et al., “Run-Time Guided Management of Scratchpad Memories in Multicore Architectures”. PACT 2015 • L. Jaulmes et al., “Exploiting Asycnhrony from Exact Forward Recoveries for DUE in Iterative Solvers”. SC 2015 • D. Chasapis et al., “PARSECSs: Evaluating the Impact of Task Parallelism in the PARSEC Benchmark Suite.” ACM TACO 2016. • E. Castillo et al., “CATA: Criticality Aware Task Acceleration for Multicore Processors.” IPDPS 2016
  • 62. RoMoL … papers • T. Hayes et al “Future Vector Microprocessor Extensions for Data Aggregations.” ISCA 2016. • D. Chasapis et al., “Runtime-Guided Mitigation of Manufacturing Variability in Power-Constrained Multi-Socket NUMA Nodes.” ICS 2016 • P. Caheny et al., “Reducing cache coherence traffic with hierarchical directory cache and NUMA-aware runtime scheduling.” PACT 2016 • T. Grass et al., “MUSA: A multi-level simulation approach for next- generation HPC machines.” SC 2016 • I. Brumar et al., “ATM: Approximate Task Memoization in the Runtime System.” IPDPS 2017 • K. Chronaki et al., “Task Scheduling Techniques for Asymmetric Multi-Core Systems.” IEEE TPDS 2017 • C. Ortega et al., “libPRISM: An Intelligent Adaptation of Prefetch and SMT Levels.” ICS 2017
  • 63. Roadmaps to Exaflop From Tianhe-2.. …to Tianhe-2A with domestic technology. From K computer… … to Post K with domestic technology. From the PPP for HPC… to future PRACE systems… …with domestic technology with domestic technology. IPCEI on HPC ?
  • 64. HPC is a global competition “The country with the strongest computing capability will host the world’s next scientific breakthroughs”. US House Science, Space and Technology Committee Chairman Lamar Smith (R-TX) “Our goal is for Europe to become one of the top 3 world leaders in high-performance computing by 2020”. European Commission President Jean-Claude Juncker (27 October 2015) “Europe can develop an exascale machine with ARM technology. Maybe we need an . consortium for HPC and Big Data”. Seymour Cray Award Ceremony Nov. 2015 Mateo Valero
  • 65. HPC: a disruptive technology for Industry “…Europe has a unique opportunity to act and invest in the development and deployment of High Performance Computing (HPC) technology, Big Data and applications to ensure the competitiveness of its research and its industries.” Günther Oettinger, Digital Economy & Society Commissioner “The transformational impact of excellent science in research and innovation” Final plenary panel at ICT - Innovate, Connect, Transform conference, 22 Oct 2015, Lisbon.
  • 66. BSC and the EC “Europe needs to develop an entire domestic exascale stack from the processor all the way to the system and application software“ Mateo Valero, Director of Barcelona Supercomputing Center Final plenary panel at ICT - Innovate, Connect, Transfor”m conference, 22 October 2015 Lisbon, Portugal. the transformational impact of excellent science in research and innovation
  • 67. Mont-Blanc HPC Stack for ARM Industrial applications System software Hardware Applications
  • 68. 512 RiscV cores in 64 clusters, 16GF/core: 8TF 4 HBM stacks (16GB, 1TB/s each): 64GB @ 4TB/s 16 custom SCM/Flash channels (1TB, 25GB/s each): 16TB @ 0.4TB/s BSC Accelerator RISC-V ISA Vector Unit · 2048b vector · 512b alu (4clk/op) 1 GHz @ Vmin OOO 4w Fetch · 64KB I$ · Decoupled I$/BP · 2 level BP · Loop Stream Detector 4w Rename/Retire D$ · 64KB · 64B/line · 128 in-flight misses · Hardware prefetch 1MB L2 per core D$ to L2 · 1x512b read · 1x512b write L2 to mesh · 1x512b read · 1x512b write Cluster holds snoop filter interposer cyclone mem mem mem mem Flash Flash Flash Flash Flash Flash Flash Flash Flash Flash Flash Flash Flash Flash Flash Flash A D D H H H H D D A D C C C C C C C C D D C C C C C C C C D H C C C C C C C C H H C C C C C C C C H H C C C C C C C C H H C C C C C C C C H D C C C C C C C C D D C C C C C C C C D A D D H H H H D D A Interposer cyclone flashmem Package substrate
  • 69. Do we need an type consortium for HPC and Big Data? A window of opportunity is open: • Basic industrial and scientific know-how is available • Excellent funding opportunities exist in H2020 at European level and in the member state structural funds It’s time to invest in large Flagship projects for HPC to gain critical mass HPC European strategy & Innovation http://ec.europa.eu/commission/2014-2019/oettinger/blog/mateo-valero- director-barcelona-supercomputing-center_en
  • 70. A New Era of Information Technology Current infrastructure sagging under its own weight Prof. Mateo Valero – Big Data Pervasive Connectivity Explosion of Information 400,710 ad requests 2000 lyrics played on Tunewiki 1,500 pings sent on PingMe 208,333 minutes Angry Birds played 23,148 apps downloaded 98,000 tweets Smart Device Expansion In 60 sec today 2013 30 Billion By 2020 40 Trillion GB … for 8 Billion 10 Million DATA (1) (2) (3) Devices Mobile Apps (4) (1) IDC Directions 2013: Why the Datacenter of the Future Will Leverage a Converged Infrastructure, March 2013, Matt Eastwood ; (2) & (3) IDC Predictions 2012: Competing for 2020, Document 231720, December 2011, Frank Gens; (4) http://en.wikipedia.org Internet of Things HPC European strategy & Innovation
  • 71. The Data Deluge 2020 40ZB* (figure exceeds prior forecasts by 5 ZBs) 2005 2010 2012 2015 8.5ZB 2.8ZB 1.2ZB0.1ZB * Source: IDC Prof. Mateo Valero – Big Data
  • 72. 10 30 This will take us beyond our decimal system Geopbyte This will be our digital universe tomorrow… Brontobyte 10 27 10 24This is our digital universe today = 250 trillion of DVDs Yottabyte 1021 1.3 ZB of network traffic by 2016 Zettabyte 10 18 1 EB of data is created on the internet each day = 250 million DVDs worth of information. The proposed Square Kilometer Array telescope will generated an EB of data per day Exabyte 10 12 Terabyte 500TB of new data per day are ingested in Facebook databases 1015 Petabyte The CERN Large Hadron Collider generates 1PB per second 109 Gigabyte 10 6 Megabyte How big is big? Saganbyte, Jotabyte,… Prof. Mateo Valero – Big Data
  • 73. Higgs and Englert’s Nobel for Physics 2013 Last year one of the most computer-intensive scientific experiments ever undertaken confirmed Peter Higgs and François Englert’s theory by making the Higgs boson – the so- called “God particle” – in an $8bn atom smasher, the Large Hadron Collider at Cern outside Geneva. “ the LHC produces 600 TB/sec… and after filtering needs to store 25 PB/year”… 15 million sensors….
  • 74. Big Data in Biology • High resolution imaging • Clinical records • Simulations • Omics
  • 75. Sequencing Costs 75Prof. Mateo Valero – Big Data Source: National Human Genome Research Institute (NHGRI) http://www.genome.gov/sequencingcosts/ (1) "Cost per Megabase of DNA Sequence" — the cost of determining one megabase (Mb; a million bases) of DNA sequence of a specified quality (2) "Cost per Genome" - the cost of sequencing a human-sized genome. For each, a graph is provided showing the data since 2001 In both graphs, the data from 2001 through October 2007 represent the costs of generating DNA sequence using Sanger-based chemistries and capillary-based instruments ('first generation' sequencing platforms). Beginning in January 2008, the data represent the costs of generating DNA sequence using 'second-generation' (or 'next-generation') sequencing platforms. The change in instruments represents the rapid evolution of DNA sequencing technologies that has occurred in recent years.
  • 76. Cognitive Computing Acuerdo de colaboración para promover conjuntamente el desarrollo de sistemas avanzados de “deep learning” con aplicaciones a los servicios bancarios
  • 77. Example: Cognitive Computing is already in business • In 2011 IBM Watson computer defeated two of Jeopardy’ s greatest champions 77 Prof. Mateo Valero – Big Data Since then, Watson supercomputer has become 24 times faster and smarter, 90% smaller, with a 2,400% improvement in performance Watson Group has collaborated with partners to build 6,000 apps
  • 78. Neural Networks • Computational model in computer science based on a collection of simple neural units. • Each neural unit is connected to many others • The strengths of these connections is expressed in terms of weights. • Neural units compute summation functions • NN are self-learning and can be trained • NN are particularly good in feature detection. • In practice, NN can be expressed in terms of matrix-matrix multiplications.
  • 80. Dios los cría y la IA+HPC los junta….
  • 82. BSC strategy for Artificial Intelligence Social & Personal Data Organ simulation Earth Sciences Industrial CASE apps Medical Imaging Genomic Analytics Text Analytics Programming models and runtimes (PyCOMPSs, TIRAMISU, interoperability current approaches) Data models and algorithms (approximate computing -- reduced precision, adaptive layers, DL/Graph Analytics, …) Precision medicine Other domains Data platforms + standards Projects with public/private institutions and companies Hw acceleration of DL workloads (novel architectures for NN, FPGA acceleration)
  • 83. NVIDIA Tesla P4 and P40 GPU’s (2016) • Tesla P4 • # CUDA cores: 2560 @ 1063MHz • Peak single precision: 5.5TFLOPS • Peak INT8: 22 TOPS • Low precision: 8-bit dot-product with 32-bit accumulate • VRAM: 8 GB GDDR5 @ 192 GB/s • TDP: ~75W • Tesla P40 • # CUDA cores: 2560 @ 1531MHz • Peak single precision: 12.0TFLOPS • Peak INT8: 47 TOPS • Low precision: 8-bit dot-product with 32-bit accumulate • VRAM: 24 GB GDDR5 @ 346GB/s • TDP: ~250W 83 Source: NVIDIA
  • 84. Google Tensor Processing Unit (2015, published 2017) • 34 GB/s off-chip memory • 28MB on-chip memory • Frequency 700MHz • TDP 75W • Matrix Multiply Unit • 256x256 MAC Units • 8-bit multiply and adds • 32-bit accumulators • Peak Throughput: 92 TOPS/s • Power Efficiency: 132 GOPS/W • GPUs for training, TPUs for inference • Gameplay to beat the World Go champion • Internally used at google for Streetview and • Rankbrain search optimizer 84 Source: google
  • 85. Nervana’s Lake Crest Deep Learning Architecture (2017) • The Lake Crest chip will operate as a Xeon Co-processor. • Tensor-based (i. e. dense linear algebra computations) • 4 8GB HBM2 at the same chip interposer@1TB/s • Each HBM has its own memory controller • 12 Inter-Chip Links (ICL) 20x faster than PCI • 12 Computing Nodes featuring several cores • Intel's new "Flexpoint" architecture within the Nodes • Flexpoint enables 10x ILP increase and low power consumption 85 Source: elektroniknet
  • 86. Quantitative Analysis of Deep Learning Architectures • Each N-neuron Hidden Layer (HL) requires a NxN GEMM • 2D NxN Systolic Array carries out NxN GEMM in 2N+1 cycles. 86 0,00001 0,0001 0,001 0,01 0,1 1 10 Seconds/GEMM Matrix Size 166MHz 1000MHz 1500MHz 2500MHz 0,01 0,1 1 10 100 1000 10000 GB/s Matrix Size 166MHz 1000MHz 1500MHz 2500MHz 1,E+07 1,E+09 1,E+11 1,E+13 1,E+15 1,E+17 OP/s Matrix Size 166MHz 1000MHz 1500MHz 2500MHz
  • 87. Quantitative Analysis of Deep Learning Architectures • Each N-neuron Hidden Layer (HL) requires a NxN GEMM • 2D NxN Systolic Array carries out NxN GEMM in 2N+1 cycles. 87 0,00001 0,0001 0,001 0,01 0,1 1 10 Seconds/GEMM Matrix Size 166MHz 1000MHz 1500MHz 2500MHz 0,01 0,1 1 10 100 1000 10000 GB/s Matrix Size 166MHz 1000MHz 1500MHz 2500MHz 1,E+07 1,E+09 1,E+11 1,E+13 1,E+15 1,E+17 OP/s Matrix Size 166MHz 1000MHz 1500MHz 2500MHz HL of ~1024 neurons can identify simple images – 28x28 pixel images – Each image contains a digit 0-9
  • 88. Quantitative Analysis of Deep Learning Architectures • Each N-neuron Hidden Layer (HL) requires a NxN GEMM • 2D NxN Systolic Array carries out NxN GEMM in 2N+1 cycles. 88 0,00001 0,0001 0,001 0,01 0,1 1 10 Seconds/GEMM Matrix Size 166MHz 1000MHz 1500MHz 2500MHz 0,01 0,1 1 10 100 1000 10000 GB/s Matrix Size 166MHz 1000MHz 1500MHz 2500MHz 1,E+07 1,E+09 1,E+11 1,E+13 1,E+15 1,E+17 OP/s Matrix Size 166MHz 1000MHz 1500MHz 2500MHz HL of ~4096 neurons can identify images containing a single concept – 32x32 pixel images – Each image is classified by categories like “ship”, “cat” or “deer”.
  • 89. Quantitative Analysis of Deep Learning Architectures • Each N-neuron Hidden Layer (HL) requires a NxN GEMM • 2D NxN Systolic Array carries out NxN GEMM in 2N+1 cycles. 89 0,00001 0,0001 0,001 0,01 0,1 1 10 Seconds/GEMM Matrix Size 166MHz 1000MHz 1500MHz 2500MHz 0,01 0,1 1 10 100 1000 10000 GB/s Matrix Size 166MHz 1000MHz 1500MHz 2500MHz 1,E+07 1,E+09 1,E+11 1,E+13 1,E+15 1,E+17 OP/s Matrix Size 166MHz 1000MHz 1500MHz 2500MHz Lake Crest’s mem BW (~TB/s) targets very large HL with O(10,000-100,000) neurons These NN are used for complex image analysis
  • 90. BSC Proposal for Deep Learning 90 16 2D-systolic arrays 4096x4096@1GHz: 134TOP/s 4 HBM stacks (16GB@1TB/s each): 64 GB @ 4TB/s DDR5 SDRAM (384GB@180GB/s): 384GB @ 0.18TB/s SoC HBM HBM HBMHBM Interposer HBM MC HBM MC HBM MCHBM MC DDR DDR Switch General Purpouse SoC Switch Syst Arrays Syst Arrays
  • 91. Human Brain Project • 10-year, 1000M€ FET flagship project • Goal: to pull together all existing knowledge about the human brain and to reconstruct the brain in supercomputer based models and simulations. 91 Expected outcomes: new treatments for brain disease and new brain-like computing technologies BSC role: Provision and optimisation of programming models to allow simulations to be developed efficiently MareNostrum part of the HPC platform for simulations
  • 92. View from Europe: SpiNNaker machine • HBP platform – 500,000 cores – 6 cabinets (including server) • Launch – 30 March 2016 92
  • 93. IBM TrueNorth Processor • 64*64=4096 cores • 256 neurons/core, 64K synapses/core • 104Kb/core memory • 65Kb for synapse states • 32Kb for neuron states/parameters • 6Kb for router destination addresses • 1Kb for axonal delays • 20mW/cm2 power density • 72mW at 0.75V • 46 Billion SOPS/Watt (Synoptic Operations Per Second) typ. • 400 Billion SOPS/Watt max. • Compared to SoA supercomputer at 4.5 Billion FLOPS/Watt Source: Science magazine
  • 94. View from Europe: Heidelberg HICANN • Wafer-scale analogue neuromorphic system • 8” 180nm wafer: • 200,000 neurons • 50M synapses • 104x faster than biology 94
  • 95. Quantum Computing: Brave New World of post-Moore architecture Quantum Processors – Dwave – IBM – Microsoft – Google – View from Europe: Delft University Prototypes
  • 96. D-Wave Quantum Processor Environment colder than space Leverages superconducting quantum effect 1000 qbits, 128K josephson junctions Installed at NSA, Google, UCSB 108X faster than Quantum Monte Carlo Algorithm on a single core* • Source: Denchev et al. What is the Computational Value of Finite-Range Tunneling? Phys. Rev. August 2016
  • 97. IBM Building “Universal Quantum Computer” Developed a Quantum Computing API to make developing quantum applications easier Promotes experimentation on publicly available 5-qbit quantum processor
  • 98. Microsoft and Google Microsoft is looking into topological quantum computing in their global “Station Q” research consortium Microsoft has “Quarc” lab working actively in quantum computer architecture in Redmond. Google manufactured a 9-bit Quantum Computer in their Quantum AI Lab Google ambition is to produce a viable quantum computer in the next five years* • Mohseni et al: “Commercialize quantum technologies in five years”, Nature (Comments) March 2017
  • 99. View from Europe: Delft Quantum Prototypes 50M Euro grant from Intel Building hybrid CMOS/Quantum processor Doing algorithms, compilers, architecture* * (To appear in DAC 2017) Riesebos et al. “Pauli Frames for Quantum Computer Architectures”