Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Mike Muller
CTO
Is there anything new in heterogeneous computing?
Evolution

Wearable
Intelligence
13
Mobile
Computing

PC
82

89

93

07

10
IOT

Embedded
77

97

Consumer

Smart
Applianc...
What’s the Innovation?
Wireless

3G

MEMS
CCD
Media
Social Media?
Semiconductor Process?

GPS
Mobility Trends: CMOS
10,000

cm2/(V·s)

1,000
100
10
1990

NMOS
PMOS
1995

2000

2005

2010

2015

Planar CMOS

5nm

HNW
...
Printing:

Moore’s Law and Ink Jets
Drops/Second

1/Size (pL-1)

1E11

1E1
10’s microns

1E10
100’s microns

1E9

1E0

1E8...
Printing and Imprinting Thin Film Transistors (TFT)
 Can be transparent, bio-degradable and even ingestible
 Unit cost 1...
Mobility Trends: CMOS & Thin Film Transistors
10000
1000
CPU

cm2/(V·s)

100
10
1
0.1

ARM1

3µ
6MHz
CortexM0

0.01

2µ
20...
Top Right

and Bottom Left
Is There Anything New in Heterogeneous Computing?
Vector Add

Reduction

Matrix Mul

GPU OpenCL on GPU

1.00

1.00

1.00

...
How Do People Program?

~20M Programmers

Web

Mobile
Embedded
~200k

Desktop

 Simple, old-school ray tracer
 Start wit...
Moving the Code onto OpenCL 1.x
 Need to make the following changes
a)
b)
c)
d)
e)
f)
g)

Get rid of all the pointers, bo...
Moving the Code onto OpenCL 2
 Need to make the following changes
a)
b)
c)
d)
e)
f)
g)

Get rid of all the pointers, both...
Moving the Code onto C++ AMP
 Need to make the following changes
a)
b)
c)
d)
e)
f)
g)

Get rid of all the pointers, both ...
Moving the Code onto HSA
 Need to make the following changes
a)
b)
c)
d)
e)
f)
g)

Get rid of all the pointers, both in s...
What Makes GPUs Good For Power Efficient Compute?
 Relaxed single-threaded performance




 No dynamic scheduling
 No...
..
Heterogeneous Compute Homogeneous Architecture

big

LITTLE

 How about a SIMTish ARM?
 Familiar programming model, C...
Moving the Code onto a Warped ARM
 Need to make the following changes








Get rid of all the pointers, both in...
Performance vs Effort
 We’ve implemented SGEMM, a matrix-matrix multiplication benchmark, in various
ways, to investigate...
Scale Needs Standards
Works for geeks…
No proper orchestration
Battle for the apps platform
Needs home IT support
Or only single manufacturer

I...
Functional Becomes the Internet of things
Functional

Little Data
Mike

My Data

X

Gym

X
Life
Insurance

!
Their Data

Car
Insurance

Rob Curtis Haymakers Cambridge
Picture by Keith Jone...
Sharing Needs Trust
IOT Medical Devices
 First implantable Pacemaker 1958
 Can a pacemaker be hacked to kill?
 Or just a plot line in US TV...
Trust Needs Security
It’s a Heterogeneous Future

Reach

The future
Open Data
and Objects

Scale Needs Standards
Sharing Needs Trust
Trust Need...
Upcoming SlideShare
Loading in …5
×

Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by Mike Muller, Chief Technology Officer, ARM

2,114 views

Published on

Keynote presentation, Is There Anything New in Heterogeneous Computing, by Mike Muller, Chief Technology Officer, ARM, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.

Published in: Technology
  • Be the first to comment

Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by Mike Muller, Chief Technology Officer, ARM

  1. 1. Mike Muller CTO Is there anything new in heterogeneous computing?
  2. 2. Evolution Wearable Intelligence 13 Mobile Computing PC 82 89 93 07 10 IOT Embedded 77 97 Consumer Smart Appliances Computing Cloud Server 1960 1970 1980 1990 2000 2010 2020
  3. 3. What’s the Innovation? Wireless 3G MEMS CCD Media Social Media? Semiconductor Process? GPS
  4. 4. Mobility Trends: CMOS 10,000 cm2/(V·s) 1,000 100 10 1990 NMOS PMOS 1995 2000 2005 2010 2015 Planar CMOS 5nm HNW FinFET Strain 3.5nm 2020 2025 III-V GE NEMS HKMG Switches 7nm 14nm 10nm VNW spintronics 2D: C, MoS Graphene wire, CNT via Interconnect Al wires // 3DIC Opto I/O Opto int CU wires SADP Patterning LELE SAQP LELELE EUV Seq. 3D EUV + DWEB EUV LELE EUV + DSA
  5. 5. Printing: Moore’s Law and Ink Jets Drops/Second 1/Size (pL-1) 1E11 1E1 10’s microns 1E10 100’s microns 1E9 1E0 1E8 1E7 1E-1 1E6 10,000 nozzles 1E5 1E-2 10 nozzles 1E4 1E3 1E-3 1980 1985 1990 1995 2000 2005 2010 2015 2020
  6. 6. Printing and Imprinting Thin Film Transistors (TFT)  Can be transparent, bio-degradable and even ingestible  Unit cost 1000 less than mainstream CMOS    CMOS @ $40,000/m2 vs. TFT @ $10/m2 Printing CAPEX can be less than $1,000  350dpi = 200um @ 20 m/s  Can print batteries, antenna  Mainly organic at ~20 volts Imprint CAPEX a $2M DVD press is high volume  Better controllability hence higher density and performance  1um today scale to 50nm features as used today for BluRay discs  Mainly Inorganic NMOS only at ~2 volts
  7. 7. Mobility Trends: CMOS & Thin Film Transistors 10000 1000 CPU cm2/(V·s) 100 10 1 0.1 ARM1 3µ 6MHz CortexM0 0.01 2µ 20kHz 0.001 0.0001 0.00001 1990 1995 2000 2005 2010 2015 Conventional NMOS Conventional PMOS TFT 2020 2025
  8. 8. Top Right and Bottom Left
  9. 9. Is There Anything New in Heterogeneous Computing? Vector Add Reduction Matrix Mul GPU OpenCL on GPU 1.00 1.00 1.00 GPU OpenCL on FPGA 0.14 0.02 0.89 FPGA OpenCL on FPGA 1.71 1.62 31.85 1998 Manual Partitioning C & Assembler ARM + DSP 2013 Manual Partitioning C++ & OpenCL/RenderScript ARM + GPU
  10. 10. How Do People Program? ~20M Programmers Web Mobile Embedded ~200k Desktop  Simple, old-school ray tracer  Start with C++ code and accelerate the code with Heterogeneous Systems void traceScreen() { for(y = 0; y < height; ++y) { for(x = 0; x < width; ++x){ Ray ray = generateRay(x, y); IntersectableObject *obj = traceRay(ray); framebuffer[y][x] = colorPixelForObject(obj); } } } void traceScreen() { par_for_2D(height, width, [&](int y, int x) { Ray ray = generateRay(x, y); IntersectableObject *obj = traceRay(ray); framebuffer[y][x] = colorPixelForObject(obj); }); }
  11. 11. Moving the Code onto OpenCL 1.x  Need to make the following changes a) b) c) d) e) f) g) Get rid of all the pointers, both in scene vector and internally in CSGObject Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals Get rid of the virtual function calls Change the classes to structs Get rid of recursion in CSGObject Avoid accessing the global scene variable in accelerated code Port the code base to OpenCL C
  12. 12. Moving the Code onto OpenCL 2  Need to make the following changes a) b) c) d) e) f) g) Get rid of all the pointers, both in scene vector and internally in CSGObject Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals Get rid of the virtual function calls Change the classes to structs Get rid of recursion in CSGObject Avoid accessing the global scene variable in accelerated code Port the code base to OpenCL C  OpenCL 2 solves point a) with shared address space, but not the rest
  13. 13. Moving the Code onto C++ AMP  Need to make the following changes a) b) c) d) e) f) g) Get rid of all the pointers, both in scene vector and internally in CSGObject Rewrite the use of std::vector, as C++ AMP cannot call into C++ standard library Get rid of the virtual function calls Change the classes to structs Get rid of recursion in CSGObject Avoid accessing the global scene variable in accelerated code Port the code base to OpenCL C  C++ AMP solves points d), f) and g), but not the rest
  14. 14. Moving the Code onto HSA  Need to make the following changes a) b) c) d) e) f) g) Get rid of all the pointers, both in scene vector and internally in CSGObject Rewrite the use of std::vector, as HSAIL does not understand C++ data type internals Get rid of the virtual function calls Change the classes to structs Get rid of recursion in CSGObject Avoid accessing the global scene variable in accelerated code Port the code base to a language on top of HSAIL  HSA solves points a), c), d), e) and soon f)
  15. 15. What Makes GPUs Good For Power Efficient Compute?  Relaxed single-threaded performance    No dynamic scheduling  No branch prediction  No register renaming, no result forwarding  Longer pipelines  Lower clock frequencies Multi-threading  Tolerate long latencies to memory Increasing the ALU/control ratio  Short-vectors exposed to programmers  SIMT/Warp/VLIW/Wavefront based execution
  16. 16. .. Heterogeneous Compute Homogeneous Architecture big LITTLE  How about a SIMTish ARM?  Familiar programming model, C++ and OpenMP  Fewer seams  Sharing data structures and function pointers/vtables Integer Pipe FP Pipe Load/Store Pipe Write SIMT Queue RESEARCH Throughput
  17. 17. Moving the Code onto a Warped ARM  Need to make the following changes        Get rid of all the pointers, both in scene vector and internally in CSGObject Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals Get rid of the virtual function calls Change the classes to structs Get rid of recursion in CSGObject Avoid accessing the global scene variable in accelerated code Port the code base to OpenCL C
  18. 18. Performance vs Effort  We’ve implemented SGEMM, a matrix-matrix multiplication benchmark, in various ways, to investigate the tradeoff between programmer effort and performance payoff SGEMM version ARM in C Speedup Effort 1x Low ARM in C with NEON intrinsics, prefetching 15x Medium - High ARM in assembly with NEON, prefetching 26x High SIMTish ARM in C 35x Low SIMTish ARM in C, unrolled 44x Low - Medium Mali GPU x 4 way 136x High
  19. 19. Scale Needs Standards
  20. 20. Works for geeks… No proper orchestration Battle for the apps platform Needs home IT support Or only single manufacturer IPv4 Sonosnet IPv6 Imagine that there were a 1000 of these connected devices….
  21. 21. Functional Becomes the Internet of things Functional Little Data
  22. 22. Mike My Data X Gym X Life Insurance ! Their Data Car Insurance Rob Curtis Haymakers Cambridge Picture by Keith Jones
  23. 23. Sharing Needs Trust
  24. 24. IOT Medical Devices  First implantable Pacemaker 1958  Can a pacemaker be hacked to kill?  Or just a plot line in US TV series RF interface for adjusting settings   First hacked in 2008   “Sustained effort by a team of specialists” – The New York Times  Range a few cm Today  MIT grad students  One weekend  Range 50 feet
  25. 25. Trust Needs Security
  26. 26. It’s a Heterogeneous Future Reach The future Open Data and Objects Scale Needs Standards Sharing Needs Trust Trust Needs Security Applications Mobile internet Internet / broadband M2M SaaS Fixed Telephony Networks Smart Everything Sensors & Actuators Networks Today Mobile Telephony

×