SlideShare a Scribd company logo
V. Milutinovic, G. Rakocevic, S. Stojanovic, and Z. Sustran
University of Belgrade

Oskar Mencer
Imperial College, London

Oliver Pell

Maxeler Technologies, London and Palo Alto

Michael Flynn
Stanford University, Palo Alto, USA

Valentina Balas
Aurel Vlaicu University of Arad, Romania, Maxeler Ambassador

1/83
An Alternative Title:
How to hire
more than 1000 PhD students
at no additional cost
to tax payers?
For Big Data algorithms
and for the same hardware price as before,
achieving:
a) speed-up, 20-200
b) monthly electricity bills, reduced 20 times
c) size, 20 times smaller
The major issues of engineering are: design cost and design complexity.
Remember, economy has its own rules: production count and market demand!
3/83
Elaboration :)
If a computer center spends E50M/year on electricity bills,
and moves most of its time-consuming algorithms to Maxeler,
which uses 20 times less power,
the yearly spending drops down to E2.5M,
and E47.5M is saved to tax payers :)
If the average net salary of a PHD student in Germany is E1500,
and if the overhead factor is 1.00,
it is easy to calculate
that E47.5M can pay 2611 PHD students to work for one year,
and that can go year after year :)
If the overhead factor is 2.611
(I do not know how big it is, but it is less than 2.611, for sure),
one can hire 1000 PHD students, at no additional cost :)
1. Over 95% of run time in loops
2.
3.

4.
5.
6.

[loops to almost zero]
Reusability of data (e.g., x+x2+x3+x4+…)
[how close to zero?]
BigData
[prog: for data streaming, not for data
control]
Latency
A new programming model
WORM [prog.effort+comp.tim]
Use a tractor, not a Ferrari, to drive over a plowed field

5/83
Absolutely all results achieved in Europe:
a) All hardware produced in Europe,
specifically UK
b) All software generated by programmers
of EU and WB

6/83
ControlFlow (MultiFlow and ManyFlow):
 Top500 ranks using Linpack

(Japanese K, IBM Sequoya, Cray Titan, …)

DataFlow:
 Coarse Grain (HEP) vs. Fine Grain (Maxeler)
The history starts in 1960's!
The enabler technology did not exist before the year 2000!

7/83
Compiling below the machine code level brings speedups;
also a smaller power, size, and cost.
The price to pay:
The machine is more difficult to program.
Consequently:
Ideal for WORM applications :)
Examples using Maxeler:
GeoPhysics (20-200), Banking (200-2000, with JP Morgan 20%),
M&C (New York City), Datamining (Google), …
8/83
Simulator builder
Hardware builder
9
2n+3
10/83
Why Java? Minimal Kolmogorov Complexity, etc…

11/83
12
13
tCPU =
N * NOPS * CCPU*TclkCPU
/NcoresCPU

tGPU =
N * NOPS * CGPU*TclkGPU /
NcoresGPU

tDF = NOPS * CDF * TclkDF +
(N – 1) * TclkDF / NDF

Assumptions:
1. Software includes enough parallelism to keep all cores busy
2. The only limiting factor is the number of cores.

14/83
DualCore?

Which way are the horses
going?

15/83
Is it possible

to use 2000 chicken instead of two horses?

?
==

What is better, real and anecdotic?
16/83
2 x 1000 chickens (CUDA and rCUDA)
17/83
at a
D

How about 2 000 000
ants?
18/83
Big Data Input

Results

Marmalade

19/83
Factor: 20 to 200
MultiCore/ManyCore

Dataflow

Machine Level Code

Gate Transfer Level
20/83
Factor: 20
MultiCore/ManyCore

Dataflow

21/83
Factor: 20
MultiCore/ManyCore

DataFlow

Data Processing

Data Processing
Process Control

Process Control

22/83
 MultiCore:
 Explain what to do, to the driver
 Caches, instruction buffers, and predictors needed

 ManyCore:
 Explain what to do, to many sub-drivers
 Reduced caches and instruction buffers needed

 DataFlow:
 Make a field of processing gates: 1C+2nJava+3Java
 No caches, etc. (300 students/year: BGD, BCN, LjU, ICL,…)
23/83
MultiCore:
 Business as usual

ManyCore:
 More difficult

DataFlow:
 Much more difficult
 Debugging both, application and configuration code

24/83
 MultiCore/ManyCore:
 Several minutes

 DataFlow:

 Several hours for the real hardware
 Fortunately, only several minutes for the simulator,

several seconds for reload (90% due to DRAM inertia),
and several milliseconds to restart
 The simulator supports
both the large JPMorgan machine
as well as the smallest “University Support” machine
 Good news:

 Tabula@2GHz
25/83
26/83
MultiCore:
 Horse stable

ManyCore:
 Chicken house

DataFlow:
 Ant hole

27/83
MultiCore:
 Haystack

ManyCore:
 Cornbits

DataFlow:
 Crumbs

28/83
Small Data: Toy Benchmarks (e.g., Linpack)
29/83
Medium Data
(benchmarks
favorising NVidia,
compared to Intel,…)

30/83
Big Data

31/83
Maxeler Hardware

CPUs plus DFEs
Intel Xeon CPU cores and up to
4 DFEs with 192GB of RAM

DFEs shared over Infiniband
Up to 8 DFEs with 384GB of
RAM and dynamic allocation
of DFEs to CPU servers

MaxWorkstation
Desktop development system

32/83

Low latency connectivity
Intel Xeon CPUs and 1-2 DFEs
with up to six 10Gbit Ethernet
connections

MaxCloud
On-demand scalable accelerated
compute resource, hosted in London
Major Classes of Algorithms,
from the Computational Perspective
1. Coarse grained, stateful: Business
– CPU requires DFE for minutes or hours
– Interrupts

1. Fine grained, transactional with shared database: DM
– CPU utilizes DFE for ms to s
– Many short computations, accessing common database data

1. Fine grained, stateless transactional: Science (Phy, ...)
– CPU requires DFE for ms to s
– Many short computations

33/83
Coarse Grained: Modeling

34/83

Timesteps (thousand)

70
60

Domain points (billion)

50

Total computed points (trillion)

40
30
20
10
0
0

10

20

30
40
50
Peak Frequency (Hz)

60

70

2,000
1,800

15Hz peak frequency

1,600

30Hz peak frequency

1,400

45Hz peak frequency

1,200

70Hz peak frequency

1,000
800
600

s
r
o
c
U
P
C
t
n
e
l
a
v
i
u
q
E

• Long runtime, but:
• Memory requirements
change dramatically based
on modelled frequency
• Number of DFEs allocated
to a CPU process can be
easily varied to increase
available memory
• Streaming compression
• Boundary data exchanged
over chassis MaxRing

80

400
200
0
1

4
Number of MAX2 cards

8

80
Fine Grained, Shared Data: Monitoring
• DFE DRAM contains the database to be searched
• CPUs issue transactions find(x, db)
• Complex search function
– Text search against documents
– Shortest distance to coordinate (multi-dimensional)
– Smith Waterman sequence alignment for genomes

• Any CPU runs on any DFE
that has been loaded with the database
– MaxelerOS may add or remove DFEs
from the processing group to balance system demands
– New DFEs must be loaded with the search DB before use
35/83
Fine Grained, Stateless: The BSOP Control
•
•
•
•

Analyse > 1,000,000 scenarios
Many CPU processes run on many DFEs
≈50x MPC-X vs. multi-core x86 node
Each transaction executes on any DFE
in the assigned group atomically
CPU
CPU
CPU
CPU
CPU

Market and
instruments
data

Tail
Tail
Tail
Tail
Tail
Tail
Tail
analysis
Tail
analysis
Tail
analysis
Tail
analysis
analysis
analysis
analysis
onCPU
CPU
analysis
onCPU
analysis CPU
onCPU
analysis
onCPU
onCPU
on
on CPU
on
on CPU
on CPU
Instrument
values

36/83

DFE
DFE
DFE
DFE
DFE

Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Loop over instruments
Random number
Random number
Random number
Random number
Random number
Random number
generator and
generator
Random numberand
Random number
generator and
generator
Random numberand
Random number
generator and
generator and
sampling of and
sampling underliers
generator and
generator underliers
sampling of of underliers
sampling underliers
generator and
generator and
sampling of of underliers
sampling underliers
sampling of of underliers
sampling underliers
sampling of of underliers
sampling of underliers
Price instruments
Price instruments
Price instruments
Price instruments
Price instruments
Priceusing Black
instruments
using Black
Price instruments
Priceusing Black
instruments
using Black
Price instruments
Priceusing Scholes
instruments
Black
using Black
Scholes
using Scholes
Black
using Black
Scholes
using Scholes
Black
using Black
Scholes
Scholes
Scholes
Scholes
Scholes
Selected Examples:
Business,
Mathematics,
GeoPhysics, etc.
37/83
38
An MIS Example: Credit
Derivatives
Orbital station

Climber

Tether

HW
41
Seismic Imaging

• Running on MaxNode servers
- 8 parallel compute pipelines per chip
- 150MHz => low power consumption!
- 30x faster than microprocessors

An Implementation of the Acoustic Wave Equation on FPGAs
T. Nemeth†, J. Stefani†, W. Liu†, R. Dimond‡, O. Pell‡, R.Ergas§
†
Chevron, ‡Maxeler, §Formerly Chevron, SEG 2008
42/83
The CRS Results


Performance of one MAX2 card vs. 1 CPU core


Land case (8 params), speedup of 230x



Marine case (6 params), speedup of 190x
CPU Coherency

43/83

MAX2 Coherency
44
46
466/83
4
P. Marchetti et al, 2010

Trace Stacking: Speed-up 217
• DM for Monitoring and Control in Seismic processing
• Velocity independent / data driven method
to obtain a stack of traces, based on 8 parameters
• Search for every sample of each output trace
2

t

2
hyp


2 T  2t0 T
=  t0 + w m  +
m H zy K N H T m + h T H zy K NIP H T h
zy
zy


v0
v0



(

2 parameters ( emergence angle & azimuth )
3 Normal Wave front parameters ( KN,11; KN,12 ; KN22 )
3 NIP Wave front parameters ( KNip,11; KNip,12 ; KNip22 )
47/83

)
Maxeler running Smith Waterman

48
Molecular Correlates of Tumor Signatures
from a Large Cohort
From whole slide sections, of a cohort,
to pathway analysis (Prof Bahram Parvin,
Berkeley)

High Content Analysis (HCA) on MPC-X
51
Conclusion: Nota Bene
This is about algorithmic changes,
to maximize
the algorithm to architecture match:
algorithmic modifications,
pipeline utilization,
data choreography,
and
decision making precision.
The winning paradigm of Big Data ExaScale?

52/83
Algorithmic Changes: Data Dependencies
PSI[0]

…

PSI[1]

OP

cbeta[0]

OP

cbeta[1]

PSI[N-3]

OP

…

…

0

OP’

OP’

…

PSI[0]

PSI[1]

PSI[2]

…

PSI[N-2]

PSI[N-1]

OP

cbeta[N-3]

OP’

PSI[N-3]

Example generated by Sasa Stojanovic (Gross-Pitaevskii)

cbeta[N-2]

OP’

0

PSI[N-2]

PSI[N-1]

53/83
Pipeline Changes: Higher Efficiency
0
X[0,0]
X[0,1]
[0,0]

0

[0,1]
[7,0]
[7,0]
[6,0]
[6,0]
[5,0]
[5,0]
[4,0]
[4,0]
[3,0]
[3,0]
[2,0]
[2,0]
[1,0]
[1,0]
[0,0]

R[0,0]

R[0,0]

Example generated by Sasa Stojanovic (Gross-Pitaevskii)
54/83
Data Recoreography: Pipeline Utilization
Example generated by Sasa Stojanovic (Gross-Pitaevskii)

Order of data accesses
inside of a burst

…

…

…

55/83
Fixed Point: Savings Reinvestable
• Consider fixed point
compared to single precision floating point
• If the range is tightly confined,
one could use 24-bit fixed point
• If data has a wider range, may need 32-bit fixed point
hwFloat(8,24) hwFix(24,...)
Add
Multiply

hwFix(32,...)

500 LUTs

24 LUTs

32 LUTs

2 DSPs

2 DSPs

4 DSPs

• Arithmetic is not 100% of the chip.
In practice, often ~5x performance boost from fixed point.
56
 Revisiting the Top 500 SuperComputers benchmarks
 Our paper in Communications of the ACM

 Revisiting all major Big Data DM algorithms

 Massive static parallelism at low clock frequencies

 Concurrency and communication

 Concurrency between millions of tiny cores difficult,

“jitter” between cores will harm performance
at synchronization points

 Reliability and fault tolerance

 10-100x fewer nodes, failures much less often

 Memory bandwidth and FLOP/byte ratio

 Optimize data choreography, data movement,

and the algorithmic computation

 New architecture of n-Programming paradigms
57/83
FP7: RoMoL@BCN

The SAB goal: Out of box thinking!
58/83
FP7: BalCon@SRB

The vision of Alkis Konstantellos

The SAB goal: Seed for new proposals!
59/83
DAFNE: Leader MISANU

60/83
DAFNE = South (MaxCode) + North
(BigData)
MISANU, IMP, KG, NS,
UK
BSC, UPV,
Sweden
U of Siena, U of Roma,
Norway
IJS, FRI,
Denmark
Germany
IRB,
France
QPLAN,
Bogazici, U of Istanbul,
Austria
U of Bucharest, U of Arad,
Swiss
U of Tuzla,
Poland
Technion, Maxeler Israel, IPSI
Hungary
61/83
61/83
The DAFNE Map

62/83
The TriPeak @
DATAMAN

Siena
+ BSC
+ Imperial College
+ Maxeler
+ Belgrade

63/83
46/83
The TriPeak: Essence
MontBlanc = A ManyCore (NVidia) + a MultiCore (ARM)
Maxeler = A FineGrain DataFlow (FPGA)
How about a happy marriage?
MontBlanc (ompSS) and Maxeler (an accelerator)
In each happy marriage,
it is known who does what :)
The Big Data DM algorithms:
What part goes to MontBlanc and what to Maxeler?
64/83
64/83
TriPeak: Core of the Symbiotic
Success
An intelligent DM algorithmic scheduler,
partially implemented for compile time,
and partially for run time.
At compile time:
Checking what part of code fits where
(MontBlanc or Maxeler): LoC 1M vs 2K vs 20K
At run time:
Rechecking the compile time decision,
based on the current data values.
65/83
65/83
66
66/83
Maxeler: Research (Google: good
method)

Structure of a Typical Research Paper: Scenario #1
[Comparison of Platforms for One Algorithm]
Curve A: MultiCore of approximately the same PurchasePrice
Curve B: ManyCore of approximately the same PurchasePrice
Curve C: Maxeler after a direct algorithm migration
Curve D: Maxeler after algorithmic improvements
Curve E: Maxeler after data choreography
Curve F: Maxeler after precision modifications

Structure of a Typical Research Paper: Scenario #2
[Ranking of Algorithms for One Application]
CurveSet A: Comparison of Algorithms on a MultiCore
CurveSet B: Comparison of Algorithms on a ManyCore
CurveSet C: Comparison on Maxeler, after a direct algorithm migration
CurveSet D: Comparison on Maxeler, after algorithmic improvements
CurveSet E: Comparison on Maxeler, after data choreography
CurveSet F: Comparison on Maxeler, after precision modifications

67/83
67/83
Maxeler Research in Serbia:
Special Issue of IPSI Transactions
Journal
KG: Blood Flow, Tijana Djukic and Prof. Filipovic

NS: Combinatorial Math, Prof. Senk and Ivan Stanojevic
MISANU: The SAT Math, Zivojin Sustran and Prof. Ognjanovic
ETF: Meteorology, Radomir Radojicic and Marko Stankovic
ETF: Physics (Gross Pitaevskii 3D real), Sasa Stojanovic
ETF: Physics (Gross Pitaevskii 3D imaginary), Lena Parezanovic
68/83
68/83
Maxeler Research WorldWide:
Special Issue of Advances in Computers @ SCI

Stanford, Texas,
Imperial, Maxeler,
ETF, MF, MISANU, IMP, KG, NS,
BSC, UPV,
U of Siena, U of Roma,
IJS, FRI, …

69/83
69/83
© H. Maurer

70
70/83
Maxeler: Teaching (Google: prof
vm) VLSI, PowerPoints, Maxeler:
TEACHING,
Maxeler Veljko Explanations, August 2012
Maxeler Veljko Anegdotic,
Maxeler Oskar Talk, August 2012
Maxeler Forbes Article
Flyer by JP Morgan
Flyer by Maxeler HPC
Tutorial Slides by Sasha and Veljko: Practice (Current Update)
Paper, unconditionally accepted for Advances in Computers by Elsevier
Paper, unconditionally accepted for Communications of the ACM
Tutorial Slides by Oskar: Theory (7 parts)
Slides by Jacob, New York
Slides by Jacob, Alabama
Slides by Sasha: Practice (Current Update)
Maxeler in Meteorology
Maxeler in Mathematics
Examples generated in Belgrade and Worldwide
THE COURSE ALSO INCLUDES DARPA METHODOLOGY FOR MICROPROCESSOR DESIGN,
with an example
71/83
71/83
Maxeler PreConference Tutorials (2013)
Google:
IEEE HiPeak, Berlin, Germany, January 2013
ACM iSAC, Coimbra, Portugal, March 2013
IEEE MECO, Budva, Montenegro, June 2013
ACM ISCA, Tel Aviv, Israel, June 2013

72/83
72/83
Maxeler InHouse Tutorials (2013)

73/83
73/83
© H. Maurer

74
74/83
Maxeler University Program Members

75/83
How to Become a Family Member?
Options to consider:
a. MAX-UP free of charge
b. Purchasing a university-level machine
(min about $10K)
c. Purchasing a JPM-level machine
(slowly approaching $100M),
or at least a Schlumberger-level machine
(slowly moving above $10M)
76/83
76/83
Good to Know!

Maxeler employs close to 100 people, GBR and USA:
a. Maxeler cash burn per year = about $10M
b. If a university-level machine is sold at the 100% profit margin,
the company life of Maxeler is extended for about 2 hours.
c. If a university-level machine is sold at the 1% profit margin,
the company life of Maxeler is extended for 1 minute.
Our past or ongoing FP7 projects requiring Maxeler speeds:
a. ProSense
b. ARTreat
c. HiPEAC

77/83
77/83
The Educational Mission
Important note:
a. Total number of accredited universities in the whole world?
b. As per WeboMetrics, about 20000.
c. Consequently, all universities of the world together bring only:
20000 minutes of extra life, or about two weeks of extra life.
The reality:
a. University-level machines are sold at the ZERO profit margin!
b. Only the Xilinx costs, handling, and shipping.
c. Email support for student doing thesis is practically unlimited!
Conclusion: This is a chance for those who jump in first :)
78/83
78/83
Our Work Impacting Maxeler
Milutinovic, V., Knezevic, P., Radunovic, B., Casselman, S., Schewel, J., Obelix
Searches Internet Using Customer Data, IEEE COMPUTER, July 2000 (impact
factor 2.205/2010).
Milutinovic, V., Cvetkovic, D., Mirkovic, J., Genetic Search Based on Multiple
Mutation Approaches, IEEE COMPUTER, November 2000 (impact factor
2.205/2010).
Milutinovic, V., Ngom, A., Stojmenovic, I., STRIP --- A Strip Based Neural Network
Growth Algorithm for Learning Multiple-Valued Functions, IEEE TRANSACTIONS
ON NEURAL NETWORKS, March 2001, Vol.12, No.2, pp. 212-227.
Jovanov, E., Milutinovic, V., Hurson, A., Acceleration of Nonnumeric Operations
Using Hardware Support for the Ordered Table Hashing Algorithms, IEEE
TRANSACTIONS ON COMPUTERS, September 2002, Vol.51, No.9, pp. 1026-1040
(impact factor 1.822/2010).
79/83
79/83
Maxeler Impacting Our Work
Tafa, Z., Rakocevic, G., Mihailovic, Dj., Milutinovic, V., Effects of
Interdisciplinary Education On Technology-driven Application Design IEEE
Transactions on Education, August 2011, pp.462-470. (impact factor
1.328/2010).
Tomazic, S., Pavlovic, V., Milovanovic, J., Sodnik, J., Kos, A., Stancin, S.,
Milutinovic, V., Fast File Existence Checking in Archiving Systems ACM
Transactions on Storage (TOS) TOS Homepage archive, Volume 7 Issue 1,
June 2011, ACM New York, NY, USA.
Jovanovic, Z., Milutinovic, V., FPGA Accelerator for Floating-Point Matrix
Multiplication, IEE Computers & Digital Techniques, 2012, 6, (4), pp. 249256.
Flynn, M., Mencer, O., Milutinovic, V., Rakocevic, G., Stenstrom, P., Trobec,
R., and Valero, M., Moving from Petaflops (on Simple Benchmarks) to
Petadata per Unit of Time and Power (On Sophisticated Benchmarks)
Communications of the ACM, May 2013 (impact factor 1.919/2010).
80/83
80/83
Current Main Efforts of Maxeler
1. To encourage a lot of software to be written/ported.
This is a key business opportunity that needs to be developed.
2. Maxeler is building up a website and a community
to share software for DFEs.
This would allow the software to also be sold
directly from the Maxeler website.
3. If a PhD student ports an important software
to a Maxeler machine,
she/he could become the first software vendor in the world
for dataflow computers,
and Maxeler would be happy to help sell licenses.
81/83
Current Side Efforts of Maxeler
1. Developing new tools for easier making of kernels.
2. Bringing new languages to Maxeler:
C, C++, MathLab, Matematika
3. Porting popular application packages to Maxeler:
OpenSees, etc...
4. Trying the Tabula FPGA!
5. Getting more than 1TeraByte/sec thru I/O
6. Minimizing the hardware, so it can go into Galaxy 5,6…
82/83
NewTools: MaxSkins
Custom Engine
Interfaces
(.c)

MaxCompiler

.max file

Testing /
Application integration

Dataflow Design
(.maxj)

MaxCompiler
App Packager

.max file developer
.max file user

App
Installer

SLiC level programming MATLAB

.mex

.m

C/C++

R

Excel

83

Python

83/83
Getting Started a Practical Work
from the Linux Shell
1. Open a shell terminal (e.g., $ /usr/bin/xfce4-terminal).
2. Connect to the Maxeler machine
(e.g., $ ssh root@147.91.12.216).
3. If more shell screens needed, start screen (e.g., $ screen).
4. Switch to the directory that contains
the 2n+3 programs you wrote
(e.g., $ cd Desktop/workspace/src/ind/z88/).
5. Prepare your C code for measuring the execution time
(e.g., clock_gettime(CLOCK_REALTIME, &t2);).
6. See what you can do (e.g., $ make).
7. Select one of those that you can do
(e.g., $ make build-sim, $ make run-sim,
$ make build-hw, $ make run-hw).
8. Measure the power consumption at the wall plug.
84/83
Q&A

vm@etf.rs

© H. Maurer

85
85/83

More Related Content

What's hot

GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)Fatima Qayyum
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoSciCompIIT
 
Feeding the Multicore Beast:It’s All About the Data!
Feeding the Multicore Beast:It’s All About the Data!Feeding the Multicore Beast:It’s All About the Data!
Feeding the Multicore Beast:It’s All About the Data!Slide_N
 
Supercomputer - Overview
Supercomputer - OverviewSupercomputer - Overview
Supercomputer - OverviewARINDAM ROY
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...Ganesan Narayanasamy
 
The Rise of Small Satellites
The Rise of Small SatellitesThe Rise of Small Satellites
The Rise of Small Satellitesmooctu9
 
Miniaturizing Space: Small-satellites
Miniaturizing Space: Small-satellitesMiniaturizing Space: Small-satellites
Miniaturizing Space: Small-satellitesX. Breogan COSTA
 
Effective machine learning_with_tpu
Effective machine learning_with_tpuEffective machine learning_with_tpu
Effective machine learning_with_tpuAthul Suresh
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIinside-BigData.com
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensOscar Law
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computerPriya Manik
 

What's hot (20)

Super Computer
Super ComputerSuper Computer
Super Computer
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
 
Supercomputers
SupercomputersSupercomputers
Supercomputers
 
Feeding the Multicore Beast:It’s All About the Data!
Feeding the Multicore Beast:It’s All About the Data!Feeding the Multicore Beast:It’s All About the Data!
Feeding the Multicore Beast:It’s All About the Data!
 
Supercomputer - Overview
Supercomputer - OverviewSupercomputer - Overview
Supercomputer - Overview
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
 
The Rise of Small Satellites
The Rise of Small SatellitesThe Rise of Small Satellites
The Rise of Small Satellites
 
Memoryhierarchy
MemoryhierarchyMemoryhierarchy
Memoryhierarchy
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
 
Miniaturizing Space: Small-satellites
Miniaturizing Space: Small-satellitesMiniaturizing Space: Small-satellites
Miniaturizing Space: Small-satellites
 
supercomputer
supercomputersupercomputer
supercomputer
 
cnsm2011_slide
cnsm2011_slidecnsm2011_slide
cnsm2011_slide
 
Introduction to SLURM
Introduction to SLURMIntroduction to SLURM
Introduction to SLURM
 
Effective machine learning_with_tpu
Effective machine learning_with_tpuEffective machine learning_with_tpu
Effective machine learning_with_tpu
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware Challegens
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computer
 
Super computer 2017
Super computer 2017Super computer 2017
Super computer 2017
 

Similar to Anegdotic Maxeler (Romania)

Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010TELECOM I+D
 
Data flow super computing valentina balas
Data flow super computing   valentina balasData flow super computing   valentina balas
Data flow super computing valentina balasValentina Emilia Balas
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE
 
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...Codemotion
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...EUDAT
 
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale FrontierMultiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontierinside-BigData.com
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overviewNabil Chouba
 
Energy Efficient Computing - 26mar13
Energy Efficient Computing - 26mar13Energy Efficient Computing - 26mar13
Energy Efficient Computing - 26mar13Ian Phillips
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceLEGATO project
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Ls catalog thiet bi tu dong master p 5000-e
Ls catalog thiet bi tu dong master p 5000-eLs catalog thiet bi tu dong master p 5000-e
Ls catalog thiet bi tu dong master p 5000-eDien Ha The
 
Ls catalog thiet bi tu dong master p 5000-e_dienhathe.vn
Ls catalog thiet bi tu dong master p 5000-e_dienhathe.vnLs catalog thiet bi tu dong master p 5000-e_dienhathe.vn
Ls catalog thiet bi tu dong master p 5000-e_dienhathe.vnDien Ha The
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the ContinuumIan Foster
 
Qcom XR Workshop Sept 2020
Qcom XR Workshop Sept 2020Qcom XR Workshop Sept 2020
Qcom XR Workshop Sept 2020Eiko Seidel
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 

Similar to Anegdotic Maxeler (Romania) (20)

Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
 
Data flow super computing valentina balas
Data flow super computing   valentina balasData flow super computing   valentina balas
Data flow super computing valentina balas
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
 
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale FrontierMultiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overview
 
Energy Efficient Computing - 26mar13
Energy Efficient Computing - 26mar13Energy Efficient Computing - 26mar13
Energy Efficient Computing - 26mar13
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Ls catalog thiet bi tu dong master p 5000-e
Ls catalog thiet bi tu dong master p 5000-eLs catalog thiet bi tu dong master p 5000-e
Ls catalog thiet bi tu dong master p 5000-e
 
Ls catalog thiet bi tu dong master p 5000-e_dienhathe.vn
Ls catalog thiet bi tu dong master p 5000-e_dienhathe.vnLs catalog thiet bi tu dong master p 5000-e_dienhathe.vn
Ls catalog thiet bi tu dong master p 5000-e_dienhathe.vn
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Qcom XR Workshop Sept 2020
Qcom XR Workshop Sept 2020Qcom XR Workshop Sept 2020
Qcom XR Workshop Sept 2020
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
OPAL-RT Webinar - HYPERSIM
OPAL-RT Webinar - HYPERSIMOPAL-RT Webinar - HYPERSIM
OPAL-RT Webinar - HYPERSIM
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsVlad Stirbu
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backElena Simperl
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...Sri Ambati
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...Product School
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsPaul Groth
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...Product School
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Product School
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 

Anegdotic Maxeler (Romania)

  • 1. V. Milutinovic, G. Rakocevic, S. Stojanovic, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies, London and Palo Alto Michael Flynn Stanford University, Palo Alto, USA Valentina Balas Aurel Vlaicu University of Arad, Romania, Maxeler Ambassador 1/83
  • 2. An Alternative Title: How to hire more than 1000 PhD students at no additional cost to tax payers?
  • 3. For Big Data algorithms and for the same hardware price as before, achieving: a) speed-up, 20-200 b) monthly electricity bills, reduced 20 times c) size, 20 times smaller The major issues of engineering are: design cost and design complexity. Remember, economy has its own rules: production count and market demand! 3/83
  • 4. Elaboration :) If a computer center spends E50M/year on electricity bills, and moves most of its time-consuming algorithms to Maxeler, which uses 20 times less power, the yearly spending drops down to E2.5M, and E47.5M is saved to tax payers :) If the average net salary of a PHD student in Germany is E1500, and if the overhead factor is 1.00, it is easy to calculate that E47.5M can pay 2611 PHD students to work for one year, and that can go year after year :) If the overhead factor is 2.611 (I do not know how big it is, but it is less than 2.611, for sure), one can hire 1000 PHD students, at no additional cost :)
  • 5. 1. Over 95% of run time in loops 2. 3. 4. 5. 6. [loops to almost zero] Reusability of data (e.g., x+x2+x3+x4+…) [how close to zero?] BigData [prog: for data streaming, not for data control] Latency A new programming model WORM [prog.effort+comp.tim] Use a tractor, not a Ferrari, to drive over a plowed field 5/83
  • 6. Absolutely all results achieved in Europe: a) All hardware produced in Europe, specifically UK b) All software generated by programmers of EU and WB 6/83
  • 7. ControlFlow (MultiFlow and ManyFlow):  Top500 ranks using Linpack (Japanese K, IBM Sequoya, Cray Titan, …) DataFlow:  Coarse Grain (HEP) vs. Fine Grain (Maxeler) The history starts in 1960's! The enabler technology did not exist before the year 2000! 7/83
  • 8. Compiling below the machine code level brings speedups; also a smaller power, size, and cost. The price to pay: The machine is more difficult to program. Consequently: Ideal for WORM applications :) Examples using Maxeler: GeoPhysics (20-200), Banking (200-2000, with JP Morgan 20%), M&C (New York City), Datamining (Google), … 8/83
  • 10. 10/83
  • 11. Why Java? Minimal Kolmogorov Complexity, etc… 11/83
  • 12. 12
  • 13. 13
  • 14. tCPU = N * NOPS * CCPU*TclkCPU /NcoresCPU tGPU = N * NOPS * CGPU*TclkGPU / NcoresGPU tDF = NOPS * CDF * TclkDF + (N – 1) * TclkDF / NDF Assumptions: 1. Software includes enough parallelism to keep all cores busy 2. The only limiting factor is the number of cores. 14/83
  • 15. DualCore? Which way are the horses going? 15/83
  • 16. Is it possible to use 2000 chicken instead of two horses? ? == What is better, real and anecdotic? 16/83
  • 17. 2 x 1000 chickens (CUDA and rCUDA) 17/83
  • 18. at a D How about 2 000 000 ants? 18/83
  • 20. Factor: 20 to 200 MultiCore/ManyCore Dataflow Machine Level Code Gate Transfer Level 20/83
  • 22. Factor: 20 MultiCore/ManyCore DataFlow Data Processing Data Processing Process Control Process Control 22/83
  • 23.  MultiCore:  Explain what to do, to the driver  Caches, instruction buffers, and predictors needed  ManyCore:  Explain what to do, to many sub-drivers  Reduced caches and instruction buffers needed  DataFlow:  Make a field of processing gates: 1C+2nJava+3Java  No caches, etc. (300 students/year: BGD, BCN, LjU, ICL,…) 23/83
  • 24. MultiCore:  Business as usual ManyCore:  More difficult DataFlow:  Much more difficult  Debugging both, application and configuration code 24/83
  • 25.  MultiCore/ManyCore:  Several minutes  DataFlow:  Several hours for the real hardware  Fortunately, only several minutes for the simulator, several seconds for reload (90% due to DRAM inertia), and several milliseconds to restart  The simulator supports both the large JPMorgan machine as well as the smallest “University Support” machine  Good news:  Tabula@2GHz 25/83
  • 26. 26/83
  • 27. MultiCore:  Horse stable ManyCore:  Chicken house DataFlow:  Ant hole 27/83
  • 29. Small Data: Toy Benchmarks (e.g., Linpack) 29/83
  • 32. Maxeler Hardware CPUs plus DFEs Intel Xeon CPU cores and up to 4 DFEs with 192GB of RAM DFEs shared over Infiniband Up to 8 DFEs with 384GB of RAM and dynamic allocation of DFEs to CPU servers MaxWorkstation Desktop development system 32/83 Low latency connectivity Intel Xeon CPUs and 1-2 DFEs with up to six 10Gbit Ethernet connections MaxCloud On-demand scalable accelerated compute resource, hosted in London
  • 33. Major Classes of Algorithms, from the Computational Perspective 1. Coarse grained, stateful: Business – CPU requires DFE for minutes or hours – Interrupts 1. Fine grained, transactional with shared database: DM – CPU utilizes DFE for ms to s – Many short computations, accessing common database data 1. Fine grained, stateless transactional: Science (Phy, ...) – CPU requires DFE for ms to s – Many short computations 33/83
  • 34. Coarse Grained: Modeling 34/83 Timesteps (thousand) 70 60 Domain points (billion) 50 Total computed points (trillion) 40 30 20 10 0 0 10 20 30 40 50 Peak Frequency (Hz) 60 70 2,000 1,800 15Hz peak frequency 1,600 30Hz peak frequency 1,400 45Hz peak frequency 1,200 70Hz peak frequency 1,000 800 600 s r o c U P C t n e l a v i u q E • Long runtime, but: • Memory requirements change dramatically based on modelled frequency • Number of DFEs allocated to a CPU process can be easily varied to increase available memory • Streaming compression • Boundary data exchanged over chassis MaxRing 80 400 200 0 1 4 Number of MAX2 cards 8 80
  • 35. Fine Grained, Shared Data: Monitoring • DFE DRAM contains the database to be searched • CPUs issue transactions find(x, db) • Complex search function – Text search against documents – Shortest distance to coordinate (multi-dimensional) – Smith Waterman sequence alignment for genomes • Any CPU runs on any DFE that has been loaded with the database – MaxelerOS may add or remove DFEs from the processing group to balance system demands – New DFEs must be loaded with the search DB before use 35/83
  • 36. Fine Grained, Stateless: The BSOP Control • • • • Analyse > 1,000,000 scenarios Many CPU processes run on many DFEs ≈50x MPC-X vs. multi-core x86 node Each transaction executes on any DFE in the assigned group atomically CPU CPU CPU CPU CPU Market and instruments data Tail Tail Tail Tail Tail Tail Tail analysis Tail analysis Tail analysis Tail analysis analysis analysis analysis onCPU CPU analysis onCPU analysis CPU onCPU analysis onCPU onCPU on on CPU on on CPU on CPU Instrument values 36/83 DFE DFE DFE DFE DFE Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Random number Random number Random number Random number Random number Random number generator and generator Random numberand Random number generator and generator Random numberand Random number generator and generator and sampling of and sampling underliers generator and generator underliers sampling of of underliers sampling underliers generator and generator and sampling of of underliers sampling underliers sampling of of underliers sampling underliers sampling of of underliers sampling of underliers Price instruments Price instruments Price instruments Price instruments Price instruments Priceusing Black instruments using Black Price instruments Priceusing Black instruments using Black Price instruments Priceusing Scholes instruments Black using Black Scholes using Scholes Black using Black Scholes using Scholes Black using Black Scholes Scholes Scholes Scholes Scholes
  • 38. 38
  • 39. An MIS Example: Credit Derivatives
  • 41. 41
  • 42. Seismic Imaging • Running on MaxNode servers - 8 parallel compute pipelines per chip - 150MHz => low power consumption! - 30x faster than microprocessors An Implementation of the Acoustic Wave Equation on FPGAs T. Nemeth†, J. Stefani†, W. Liu†, R. Dimond‡, O. Pell‡, R.Ergas§ † Chevron, ‡Maxeler, §Formerly Chevron, SEG 2008 42/83
  • 43. The CRS Results  Performance of one MAX2 card vs. 1 CPU core  Land case (8 params), speedup of 230x  Marine case (6 params), speedup of 190x CPU Coherency 43/83 MAX2 Coherency
  • 44. 44
  • 45.
  • 47. P. Marchetti et al, 2010 Trace Stacking: Speed-up 217 • DM for Monitoring and Control in Seismic processing • Velocity independent / data driven method to obtain a stack of traces, based on 8 parameters • Search for every sample of each output trace 2 t 2 hyp  2 T  2t0 T =  t0 + w m  + m H zy K N H T m + h T H zy K NIP H T h zy zy   v0 v0   ( 2 parameters ( emergence angle & azimuth ) 3 Normal Wave front parameters ( KN,11; KN,12 ; KN22 ) 3 NIP Wave front parameters ( KNip,11; KNip,12 ; KNip22 ) 47/83 )
  • 48. Maxeler running Smith Waterman 48
  • 49.
  • 50. Molecular Correlates of Tumor Signatures from a Large Cohort From whole slide sections, of a cohort, to pathway analysis (Prof Bahram Parvin, Berkeley) High Content Analysis (HCA) on MPC-X
  • 51. 51
  • 52. Conclusion: Nota Bene This is about algorithmic changes, to maximize the algorithm to architecture match: algorithmic modifications, pipeline utilization, data choreography, and decision making precision. The winning paradigm of Big Data ExaScale? 52/83
  • 53. Algorithmic Changes: Data Dependencies PSI[0] … PSI[1] OP cbeta[0] OP cbeta[1] PSI[N-3] OP … … 0 OP’ OP’ … PSI[0] PSI[1] PSI[2] … PSI[N-2] PSI[N-1] OP cbeta[N-3] OP’ PSI[N-3] Example generated by Sasa Stojanovic (Gross-Pitaevskii) cbeta[N-2] OP’ 0 PSI[N-2] PSI[N-1] 53/83
  • 54. Pipeline Changes: Higher Efficiency 0 X[0,0] X[0,1] [0,0] 0 [0,1] [7,0] [7,0] [6,0] [6,0] [5,0] [5,0] [4,0] [4,0] [3,0] [3,0] [2,0] [2,0] [1,0] [1,0] [0,0] R[0,0] R[0,0] Example generated by Sasa Stojanovic (Gross-Pitaevskii) 54/83
  • 55. Data Recoreography: Pipeline Utilization Example generated by Sasa Stojanovic (Gross-Pitaevskii) Order of data accesses inside of a burst … … … 55/83
  • 56. Fixed Point: Savings Reinvestable • Consider fixed point compared to single precision floating point • If the range is tightly confined, one could use 24-bit fixed point • If data has a wider range, may need 32-bit fixed point hwFloat(8,24) hwFix(24,...) Add Multiply hwFix(32,...) 500 LUTs 24 LUTs 32 LUTs 2 DSPs 2 DSPs 4 DSPs • Arithmetic is not 100% of the chip. In practice, often ~5x performance boost from fixed point. 56
  • 57.  Revisiting the Top 500 SuperComputers benchmarks  Our paper in Communications of the ACM  Revisiting all major Big Data DM algorithms  Massive static parallelism at low clock frequencies  Concurrency and communication  Concurrency between millions of tiny cores difficult, “jitter” between cores will harm performance at synchronization points  Reliability and fault tolerance  10-100x fewer nodes, failures much less often  Memory bandwidth and FLOP/byte ratio  Optimize data choreography, data movement, and the algorithmic computation  New architecture of n-Programming paradigms 57/83
  • 58. FP7: RoMoL@BCN The SAB goal: Out of box thinking! 58/83
  • 59. FP7: BalCon@SRB The vision of Alkis Konstantellos The SAB goal: Seed for new proposals! 59/83
  • 61. DAFNE = South (MaxCode) + North (BigData) MISANU, IMP, KG, NS, UK BSC, UPV, Sweden U of Siena, U of Roma, Norway IJS, FRI, Denmark Germany IRB, France QPLAN, Bogazici, U of Istanbul, Austria U of Bucharest, U of Arad, Swiss U of Tuzla, Poland Technion, Maxeler Israel, IPSI Hungary 61/83 61/83
  • 63. The TriPeak @ DATAMAN Siena + BSC + Imperial College + Maxeler + Belgrade 63/83 46/83
  • 64. The TriPeak: Essence MontBlanc = A ManyCore (NVidia) + a MultiCore (ARM) Maxeler = A FineGrain DataFlow (FPGA) How about a happy marriage? MontBlanc (ompSS) and Maxeler (an accelerator) In each happy marriage, it is known who does what :) The Big Data DM algorithms: What part goes to MontBlanc and what to Maxeler? 64/83 64/83
  • 65. TriPeak: Core of the Symbiotic Success An intelligent DM algorithmic scheduler, partially implemented for compile time, and partially for run time. At compile time: Checking what part of code fits where (MontBlanc or Maxeler): LoC 1M vs 2K vs 20K At run time: Rechecking the compile time decision, based on the current data values. 65/83 65/83
  • 67. Maxeler: Research (Google: good method) Structure of a Typical Research Paper: Scenario #1 [Comparison of Platforms for One Algorithm] Curve A: MultiCore of approximately the same PurchasePrice Curve B: ManyCore of approximately the same PurchasePrice Curve C: Maxeler after a direct algorithm migration Curve D: Maxeler after algorithmic improvements Curve E: Maxeler after data choreography Curve F: Maxeler after precision modifications Structure of a Typical Research Paper: Scenario #2 [Ranking of Algorithms for One Application] CurveSet A: Comparison of Algorithms on a MultiCore CurveSet B: Comparison of Algorithms on a ManyCore CurveSet C: Comparison on Maxeler, after a direct algorithm migration CurveSet D: Comparison on Maxeler, after algorithmic improvements CurveSet E: Comparison on Maxeler, after data choreography CurveSet F: Comparison on Maxeler, after precision modifications 67/83 67/83
  • 68. Maxeler Research in Serbia: Special Issue of IPSI Transactions Journal KG: Blood Flow, Tijana Djukic and Prof. Filipovic NS: Combinatorial Math, Prof. Senk and Ivan Stanojevic MISANU: The SAT Math, Zivojin Sustran and Prof. Ognjanovic ETF: Meteorology, Radomir Radojicic and Marko Stankovic ETF: Physics (Gross Pitaevskii 3D real), Sasa Stojanovic ETF: Physics (Gross Pitaevskii 3D imaginary), Lena Parezanovic 68/83 68/83
  • 69. Maxeler Research WorldWide: Special Issue of Advances in Computers @ SCI Stanford, Texas, Imperial, Maxeler, ETF, MF, MISANU, IMP, KG, NS, BSC, UPV, U of Siena, U of Roma, IJS, FRI, … 69/83 69/83
  • 71. Maxeler: Teaching (Google: prof vm) VLSI, PowerPoints, Maxeler: TEACHING, Maxeler Veljko Explanations, August 2012 Maxeler Veljko Anegdotic, Maxeler Oskar Talk, August 2012 Maxeler Forbes Article Flyer by JP Morgan Flyer by Maxeler HPC Tutorial Slides by Sasha and Veljko: Practice (Current Update) Paper, unconditionally accepted for Advances in Computers by Elsevier Paper, unconditionally accepted for Communications of the ACM Tutorial Slides by Oskar: Theory (7 parts) Slides by Jacob, New York Slides by Jacob, Alabama Slides by Sasha: Practice (Current Update) Maxeler in Meteorology Maxeler in Mathematics Examples generated in Belgrade and Worldwide THE COURSE ALSO INCLUDES DARPA METHODOLOGY FOR MICROPROCESSOR DESIGN, with an example 71/83 71/83
  • 72. Maxeler PreConference Tutorials (2013) Google: IEEE HiPeak, Berlin, Germany, January 2013 ACM iSAC, Coimbra, Portugal, March 2013 IEEE MECO, Budva, Montenegro, June 2013 ACM ISCA, Tel Aviv, Israel, June 2013 72/83 72/83
  • 73. Maxeler InHouse Tutorials (2013) 73/83 73/83
  • 75. Maxeler University Program Members 75/83
  • 76. How to Become a Family Member? Options to consider: a. MAX-UP free of charge b. Purchasing a university-level machine (min about $10K) c. Purchasing a JPM-level machine (slowly approaching $100M), or at least a Schlumberger-level machine (slowly moving above $10M) 76/83 76/83
  • 77. Good to Know! Maxeler employs close to 100 people, GBR and USA: a. Maxeler cash burn per year = about $10M b. If a university-level machine is sold at the 100% profit margin, the company life of Maxeler is extended for about 2 hours. c. If a university-level machine is sold at the 1% profit margin, the company life of Maxeler is extended for 1 minute. Our past or ongoing FP7 projects requiring Maxeler speeds: a. ProSense b. ARTreat c. HiPEAC 77/83 77/83
  • 78. The Educational Mission Important note: a. Total number of accredited universities in the whole world? b. As per WeboMetrics, about 20000. c. Consequently, all universities of the world together bring only: 20000 minutes of extra life, or about two weeks of extra life. The reality: a. University-level machines are sold at the ZERO profit margin! b. Only the Xilinx costs, handling, and shipping. c. Email support for student doing thesis is practically unlimited! Conclusion: This is a chance for those who jump in first :) 78/83 78/83
  • 79. Our Work Impacting Maxeler Milutinovic, V., Knezevic, P., Radunovic, B., Casselman, S., Schewel, J., Obelix Searches Internet Using Customer Data, IEEE COMPUTER, July 2000 (impact factor 2.205/2010). Milutinovic, V., Cvetkovic, D., Mirkovic, J., Genetic Search Based on Multiple Mutation Approaches, IEEE COMPUTER, November 2000 (impact factor 2.205/2010). Milutinovic, V., Ngom, A., Stojmenovic, I., STRIP --- A Strip Based Neural Network Growth Algorithm for Learning Multiple-Valued Functions, IEEE TRANSACTIONS ON NEURAL NETWORKS, March 2001, Vol.12, No.2, pp. 212-227. Jovanov, E., Milutinovic, V., Hurson, A., Acceleration of Nonnumeric Operations Using Hardware Support for the Ordered Table Hashing Algorithms, IEEE TRANSACTIONS ON COMPUTERS, September 2002, Vol.51, No.9, pp. 1026-1040 (impact factor 1.822/2010). 79/83 79/83
  • 80. Maxeler Impacting Our Work Tafa, Z., Rakocevic, G., Mihailovic, Dj., Milutinovic, V., Effects of Interdisciplinary Education On Technology-driven Application Design IEEE Transactions on Education, August 2011, pp.462-470. (impact factor 1.328/2010). Tomazic, S., Pavlovic, V., Milovanovic, J., Sodnik, J., Kos, A., Stancin, S., Milutinovic, V., Fast File Existence Checking in Archiving Systems ACM Transactions on Storage (TOS) TOS Homepage archive, Volume 7 Issue 1, June 2011, ACM New York, NY, USA. Jovanovic, Z., Milutinovic, V., FPGA Accelerator for Floating-Point Matrix Multiplication, IEE Computers & Digital Techniques, 2012, 6, (4), pp. 249256. Flynn, M., Mencer, O., Milutinovic, V., Rakocevic, G., Stenstrom, P., Trobec, R., and Valero, M., Moving from Petaflops (on Simple Benchmarks) to Petadata per Unit of Time and Power (On Sophisticated Benchmarks) Communications of the ACM, May 2013 (impact factor 1.919/2010). 80/83 80/83
  • 81. Current Main Efforts of Maxeler 1. To encourage a lot of software to be written/ported. This is a key business opportunity that needs to be developed. 2. Maxeler is building up a website and a community to share software for DFEs. This would allow the software to also be sold directly from the Maxeler website. 3. If a PhD student ports an important software to a Maxeler machine, she/he could become the first software vendor in the world for dataflow computers, and Maxeler would be happy to help sell licenses. 81/83
  • 82. Current Side Efforts of Maxeler 1. Developing new tools for easier making of kernels. 2. Bringing new languages to Maxeler: C, C++, MathLab, Matematika 3. Porting popular application packages to Maxeler: OpenSees, etc... 4. Trying the Tabula FPGA! 5. Getting more than 1TeraByte/sec thru I/O 6. Minimizing the hardware, so it can go into Galaxy 5,6… 82/83
  • 83. NewTools: MaxSkins Custom Engine Interfaces (.c) MaxCompiler .max file Testing / Application integration Dataflow Design (.maxj) MaxCompiler App Packager .max file developer .max file user App Installer SLiC level programming MATLAB .mex .m C/C++ R Excel 83 Python 83/83
  • 84. Getting Started a Practical Work from the Linux Shell 1. Open a shell terminal (e.g., $ /usr/bin/xfce4-terminal). 2. Connect to the Maxeler machine (e.g., $ ssh root@147.91.12.216). 3. If more shell screens needed, start screen (e.g., $ screen). 4. Switch to the directory that contains the 2n+3 programs you wrote (e.g., $ cd Desktop/workspace/src/ind/z88/). 5. Prepare your C code for measuring the execution time (e.g., clock_gettime(CLOCK_REALTIME, &t2);). 6. See what you can do (e.g., $ make). 7. Select one of those that you can do (e.g., $ make build-sim, $ make run-sim, $ make build-hw, $ make run-hw). 8. Measure the power consumption at the wall plug. 84/83

Editor's Notes

  1. Elastic makes things worse