Gen AI in Business - Global Trends Report 2024.pdf
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
1. BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
HOW
TO
MAKE
THE
MOST
OF
GPU
ACCESSIBLE
MEMORY
PAUL
BLINZER
FELLOW,
SYSTEM
SOFTWARE,
AMD
2. THE
AGENDA
! What’s
so
special
about
dealing
with
memory
and
a
GPU?
‒ The
programmer’s
view
of
memory
‒ Throwing
a
GPU
into
the
mix
‒ How
do
today’s
systems
deal
with
GPU
memory
access?
! The
many
different
“types”
of
memory
today
and
ways
to
access
‒ The
various
places
to
find
and
best
use
them
‒ What
changes
with
HSA
and
hUMA?
‒ Why
“buffered”
view
of
memory
is
s[ll
important
and
how
to
deal
with
it
! Where
to
find
more
informa[on?
! Q
&
A
2
|
BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|
NOVEMBER
13,
2013
|
APU13
3. WHAT’S
SO
SPECIAL
ABOUT
MEMORY
ACCESS
WITH
A
GPU?
THERE
ARE
SO
MANY
DIFFERENT
TYPES,
BUSES
AND
CACHES
INVOLVED…
LDS = Local Data Share
TU = Texture Unit
TC = Texture Cache
Discrete GPU
Accelerated Processing Unit (APU)
CPU
GPU
Instruction Cache
1..N Compute Units
CoreM
DC (L1)
Global Data Share
Instruction Cache
Global Data Share
H-CU Engine
TU
L1 (TC)
LDS
H-CU Engine
TU
L1 (TC)
H-CU Engine
TU
L1 (TC)
LDS
H-CU Engine
TU
L1 (TC)
LDS
H-CU Engine
TU
L1 (TC)
LDS
H-CU Engine
TU
L1 (TC)
LDS
H-CU Engine
TU
L1 (TC)
LDS
H-CU Engine
TU
L1 (TC)
L2 Cache
L2 Cache
HSA MMU (IOMMUv2)
Mem
Memory Controller
Memory (DDR3)
Memory (DDR3)
Cached
non-cacheable
Cached
non-cacheable
3
|
BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|
NOVEMBER
13,
2013
|
APU13
PCIE
Memory Controller
Mem
L3
Memory (GDDR5)
Memory (GDDR5)
Mem
IC, FPU, L2
LDS
LDS
1..N Compute Units Core1
Core0
IC, FPU, L2
DC (L1)
DC (L1)
Core0
Core1
DC (L1)
DC (L1)
IC, FPU, L2
Mem
Constant Cache
Mem
CoreM-1
DC (L1)
1..N Compute Units
GPU
Constant Cache
Memory (GDDR5)
4. THE
TYPICAL
APPLICATION’S
VIEW
OF
MEMORY
(1)
A
“GEDANKENEXPERIMENT”,
COMBINING
EINSTEIN
AND
TRON:
IMAGINE
YOU
ARE
A
CPU
CORE
EXECUTING
AN
APPLICATION
THREAD,
ACCESSING
DATA…
‒ The
address
may
be
represented
by
a
32bit
or
64bit
(44/48bit)
wide
ptr
value
‒ The
memory
content
may
not
even
be
resident
in
physical
memory,
paged
in
from
backup
storage
when
accessed,
maybe
pushing
other
content
out
‒ CPU
caches
keep
an
oien
used
“working
set”
of
data
close
to
the
CPU
core’s
execu[on
units
‒ CPU
cache
coherency
mechanisms
invalidate
cache
content
when
“outside
forces
“
(typically
other
CPU
cores)
update
the
content
of
system
memory
at
a
given
address,
ensuring
that
each
CPU
core
sees
the
same
data
Noncanonical
VA
Range
2 48-‐1
2 47-‐1
FB
Aperture
2 47-‐1
System
Physical
Memory
Space
2 44-‐1
Page
0x78900000
Page
Page
Page
Page
Page
Page
0x12340000
Page
0x00000000
0x00000000
Process1
4
|
BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|
NOVEMBER
13,
2013
|
APU13
Mapped
via
CPU
MMU
GPU
Buffer
‒ The
applica[on
code
has
a
“flat”
view
of
memory,
can
allocate
memory
from
the
OS,
write
&
read
data
at
that
address,
etc.
Managed
by
OS
Allocation
‒ Each
CPU
core
may
operate
independently
on
a
“thread”
within
that
process
2 64-‐1
Kernel
Mode
Address
Space
‒ Each
applica[on
is
associated
with
a
process
and
the
OS
isolates
the
address
space
of
one
process
from
any
other
on
the
system,
this
is
enforced
by
hardware
(MMU
=
“Memory
Management
Unit”)
Process
VA
Space
(CPU)
User
Process
Space
! Today’s
opera[ng
systems
have
an
applica[on
model
based
on
a
user
process
view
of
the
system
0x00000000
Process2
5. THE
TYPICAL
APPLICATION’S
VIEW
OF
MEMORY
(2)
NOW
LET’S
SEE
HOW
A
GPU
SEES
THAT
SAME
MEMORY
TODAY
AND
ADDS
TO
IT…
‒ Consider
the
resource
“handle
value
+
offset”
as
just
a
special
kind
of
“address”
outside
of
the
regular
process
address
space
☺
5
|
BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|
NOVEMBER
13,
2013
|
APU13
Alloc.
Gfx
Noncanonical
VA
Range
Framebuffer
0x98765000
Page
0x00000000
2 47-‐1
System
Physical
Memory
Space
2 44-‐1
Page
0x78900000
Page
Page
Page
GPU
Buffer
‒ To
access
the
memory
content
(all
or
part
of
it),
an
API
provides
func[ons
like
MapResourceView(),
Lock(),
Unlock()
or
similar
,establishing
“windows”
in
the
address
space
to
that
memory
either
to
GPU
or
CPU
or
put
into
staging
buffers
Mapped
via
GPU
MMU
Page
2 47-‐1
User
Process
Space
‒ The
API
typically
only
provides
a
“handle”
referencing
the
object
GPU
physical
memory
(e.g.
discrete)
2 48-‐1
‒ CreateResource(),
CreateBuffer(),
CreateTexture()…
‒ The
memory
is
managed
as
single
objects
(buffers,
resources,
textures,
…),
typically,
“malloc()-‐ed”
memory
is
typically
not
directly
accessible
by
GPU
Mapped
via
CPU
MMU
FB
Aperture
! GPU
accessible
memory
alloca[ons
are
handled
via
special
APIs
(DirectX,
OpenGL,OpenCL,
etc)
Managed
by
Gfx
Driver
Managed
by
OS
GPU
Buffer
‒ GPU
accessible
system
memory
is
“page-‐locked”
and
can’t
move
while
the
memory
may
be
accessible
by
the
GPU,
even
if
it’s
currently
not
used
at
all
‒ The
total
amount
of
memory
a
GPU
can
access
at
a
[me
is
limited
to
the
amount
of
page-‐locked
memory
or
frame
buffer
memory
2 64-‐1
GPU
Virtual
Address
Space
Page
Allocation
‒ They
can
only
access
physical
memory
pages
as
far
as
the
OS
memory
management
is
concerned,
though
GPU
may
use
“virtual
addresses”
Process
VA
Space
(CPU)
Kernel
Mode
Address
Space
! GPUs
are
typically
managed
as
devices
by
opera[ng
systems:
Page
0x56780000
Page
0x12340000
Page
0x00000000
0x00000000
Process1
0x00000000
Process2
0x00000000
GPU
6. THE
TYPICAL
APPLICATION’S
VIEW
OF
MEMORY
(3)
NOW
LET’S
SEE
HOW
A
GPU
SEES
THAT
SAME
MEMORY
TODAY
AND
ADDS
TO
IT…
6
|
BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|
NOVEMBER
13,
2013
|
APU13
Mapped
via
GPU
MMU
0x98765000
Page
Page
2 47-‐1
0x00000000
FB
Aperture
2 47-‐1
System
Physical
Memory
Space
2 44-‐1
Page
0x78900000
Page
‒ depending
on
the
system
configura[on
(e.g.
PCIe
bus
access)
Page
GPU
Buffer
Page
Page
Allocation
User
Process
Space
‒ GPU
caches
are
typically
explicitly
managed
by
the
driver
and
need
to
be
refreshed
when
the
CPU
updates
memory
content
‒ One
reason
is
hardware
complexity
to
make
this
performant
‒ Depending
on
use
scenario,
the
GPU
accessible
memory
is
mapped
as
“writethrough”,
“uncached”
or
“writecombined”
by
the
OS
APIs
GPU
physical
memory
(e.g.
discrete)
2 48-‐1
! Data
visibility
(cache
coherency)
is
typically
soiware-‐managed
‒ CPU
cache
coherency,
when
accessing
system
memory
poten[ally
updated
by
a
GPU
may
not
be
always
guaranteed
Mapped
via
CPU
MMU
Kernel
Mode
Address
Space
‒ For
frequent
accesses
from
both
CPU
&
GPU,
the
transla[on
can
be
tediously
slow
‒ Content
that
can
be
accessed
by
both
CPU
and
GPU
simultaneously
needs
data
visibility/coherency
rules
leading
to
the
next
issue…
Managed
by
Gfx
Driver
Managed
by
OS
Framebuffer
‒ The
bad
thing
about
it
is
that
it’s
an
either/or
style
access
GPU
Virtual
Address
Space
Alloc.
Gfx
2 64-‐1
Noncanonical
VA
Range
‒ where
it
can
be
more
efficiently
stored
or
processed
(e.g.
2D
[ling)
Process
VA
Space
(CPU)
GPU
Buffer
‒ The
good
thing
about
an
API
controlled
access
is
that
the
OS
and
&
driver
can
copy
the
content
to
someplace
else
and/or
to
a
different
format
Page
0x56780000
Page
0x12340000
Page
0x00000000
0x00000000
Process1
0x00000000
Process2
0x00000000
GPU
7. IT’S
ALL
ABOUT
THROUGHPUT,
BANDWIDTH
AND
LATENCY…
KEEP
YOUR
DATA
CLOSE
AND
YOUR
FREQUENTLY
USED
DATA
EVEN
CLOSER…
LDS = Local Data Share
TU = Texture Unit
TC = Texture Cache
Bandwidth: 100's GB/s
Latency: <1-10's cycles
Bandwidth: 100's-1000's GB/s
Latency: <1-10's cycles
Discrete GPU
Accelerated Processing Unit (APU)
CPU
GPU
Instruction Cache
CoreM
DC (L1)
Global Data Share
Constant Cache
Global Data Share
H-CU Engine
TU
L1 (TC)
LDS
H-CU Engine
TU
L1 (TC)
H-CU Engine
TU
L1 (TC)
LDS
H-CU Engine
TU
L1 (TC)
LDS
H-CU Engine
TU
L1 (TC)
LDS
H-CU Engine
TU
L1 (TC)
LDS
IC, FPU, L2
LDS
LDS
1..N Compute Units Core1
Core0
IC, FPU, L2
DC (L1)
DC (L1)
Core0
Core1
DC (L1)
DC (L1)
IC, FPU, L2
H-CU Engine
TU
L1 (TC)
LDS
H-CU Engine
TU
L1 (TC)
~15 GB/s
X16 PCI-E 3.0
L2 Cache
L3
IOMMUv2
~17 GB/s
(DDR3-2133)
Mem
Memory Controller
Mem
Instruction Cache
~17 GB/s
(DDR3-2133)
Memory (DDR3)
Memory (DDR3)
Cached
non-cacheable
Cached
non-cacheable
7
|
BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|
NOVEMBER
13,
2013
|
APU13
L2 Cache
PCIE
Memory Controller
Latency: 10's-100's of cycles
(mem, bus)
Mem
CoreM-1
DC (L1)
1..N Compute Units
GPU
Constant Cache
~90GB/s
(3GHz MCLK)
Memory (GDDR5)
~90GB/s
(3GHz MCLK)
Memory (GDDR5)
Mem
1..N Compute Units
Mem
Caches:
~90GB/s
(3GHz MCLK)
Memory (GDDR5)
8. IT’S
ALL
ABOUT
THE
RIGHT
TOOL
FOR
THE
JOB(1)
! The
efficient
use
of
a
GPU
&
CPU
in
a
system
depends
understanding
their
opera[on
on
memory
‒ The
cache
architecture
on
either
CPU
and
GPU
is
a
reflec[on
of
the
different
access
pauerns
for
their
“preferred”
workloads
and
data
and
so
is
the
cache
management/op[miza[on
! CPU’s
are
typically
built
to
operate
on
general
purpose,
serial
instruc[on
threads,
oien
high
data
locality,
lot’s
of
condi[onal
execu[on
and
dealing
with
data
interdependency
‒ CPU
cache
hierarchy
is
focused
on
general
purpose
data
access
from/to
execu[on
units,
feeding
back
previously
computed
data
to
the
execu[on
units
with
very
low
latency
‒ Compara[vely
few
registers
(vs
GPUs),
but
large
caches
keep
oien
used
“arbitrary”
data
close
to
the
execu[on
units
! GPUs
are
usually
built
for
a
SIMD
execu[on
model
‒ Apply
the
same
sequence
of
instruc[ons
over
and
over
on
data
with
liule
varia[on
but
high
throughput
(“streaming
data”),
passing
the
data
from
one
processing
stage
to
another
(latency
tolerance)
‒ Compute
units
have
a
rela[vely
large
register
file
store
‒ Using
a
lot
of
“specialty
caches”
(constant
cache,
Texture
Cache,
etc),
data
caches
op[mized
for
SW
data
prefetch
‒ LDS,
GDS
mainly
used
for
in-‐wavefront
or
inter-‐wavefront
updates
&
synchroniza[on
‒ Data
caches
are
typically
explicitly
flushed
by
soiware
8
|
BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|
NOVEMBER
13,
2013
|
APU13
9. IT’S
ALL
ABOUT
THE
RIGHT
TOOL
FOR
THE
JOB(2)
! The
GPU
memory
&
cache
access
design
is
well-‐suited
for
typical
2D
&
3D
graphics
workloads
(duh!)
‒ Ver[ces
data,
Textures,
etc
are
passed
from
the
host
to
the
various
stages
of
the
graphics
API
pipeline,
with
each
stage
allowing
processing
of
the
data
passing
through
via
appropriate
instruc[on
sequences
(“shaders”)
‒ Since
a
lot
of
the
data
is
“sta[c”
and
the
access
is
abstracted
via
APIs,
it
can
be
put
into
beuer
suited
data
formats
mapping
2D/3D
pixel
coordinates
“locality”
to
memory
locality
in
internal
buffers
within
the
graphics
pipeline
2D
Tiling
Y-‐Coordinate
X-‐Coordinate
16x16
16x16
16x16
16x16
16x16
...
‒ Very
beneficial
for
performance,
but
not
easily
“accessible”
by
simple
addressing
schemes,
requires
copy
of
the
data
first
‒ Today’s
graphics
APIs
(OpenGL,
Direct3D
are
well
suited
for
this
workload,
but
oien
must
focus
on
the
lowest-‐common
denominator
in
hardware
capabili[es
‒ The
API
design
assumes
that
no
cache
coherency
between
CPU
and
GPU
may
exist,
requiring
the
CPU
to
issue
explicit
cache
flushes
or
operate
on
memory
areas
mapped
as
“uncached”
if
readback
of
GPU
data
is
required
‒ Some
extensions
or
recently
introduced
features
for
“zero
copy”
memory
9
|
BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|
NOVEMBER
13,
2013
|
APU13
X0,Y0
X1,Y0
X2,Y0
Memory
Addresses
...
X15,Y0
X0,Y1
X1,Y1
X2,Y1
...
X15,Y14
X0,Y15
X1,Y15
X2,Y15
...
10. IT’S
ALL
ABOUT
THE
RIGHT
TOOL
FOR
THE
JOB(3)
! Vector/Matrix-‐oriented
compute
workloads
map
well
to
GPUs,
but
un[l
now
“suffer”
from
some
of
the
choices
that
benefit
the
graphics
data
processing
flow
‒ Compute
APIs
like
OpenCL™
or
DirectCompute
are
oien
s[ll
inherently
[ed
to
the
low-‐level
graphics
focused
GPU
infrastructure
in
today’s
OS
(e.g.
memory
management
through
Microsoi®
WDDM,
Linux®
TTM/GEM)
‒ “Zero
Copy”
Support
and
system
memory
buffer
cache
coherency
in
recent
API’s
improves
the
behavior
on
some
pla{orms
that
have
appropriate
support,
s[ll
has
some
SW
overhead
for
access
‒ All
the
memory
processed
by
the
GPU
is
referenced
through
handles
to
control
memory
page-‐lock
on
workload
dispatch
and
the
SW
needs
to
create
“Buffer
views”
either
explicitly
or
under
the
covers
to
access
regular
memory
‒ There
is
quite
some
SW
overhead
involved
in
that
! Discrete
GPU
have
excellent
compute
performance
(several
TeraFLOPS
for
even
mid-‐range
cards)
‒ But
require
the
data
to
be
accessible
in
local
memory
for
best
performance,
requiring
copy-‐opera[ons
from
host
memory
and
“keeping
the
data
on
the
other
side”
as
long
as
possible
‒ Accessing
or
pushing
the
data
back
and
forth
through
the
PCIe
bouleneck
may
reduce
or
eliminate
speedup-‐gains
or
increases
access
latency
from
host
substan[ally
10
|
BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|
NOVEMBER
13,
2013
|
APU13
11. HOW
DOES
HUMA
AND
HSA
CHANGE
THINGS
?
‒ Pla{orm
atomics
are
supported,
for
efficient
synchroniza[on
11
|
BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|
NOVEMBER
13,
2013
|
APU13
Alloc.
Gfx
Framebuffer
Noncanonical
VA
Range
0x98765000
Page
Page
2 47-‐1
0x00000000
FB
Aperture
2 47-‐1
System
Physical
Memory
Space
2 -‐1
2 47-‐1
FB
Aperture
2 47-‐1
Mapped
via
HSA
MMU
GPU
physical
memory
(e.g.
discrete)
44
Page
0x78900000
GPU
Buffer
0x78900000
GPU
Buffer
Page
Page
Page
Page
Page
Page
0x12340000
User
Process
Space
‒ A
data
pointer
has
the
same
“meaning”
(=
points
to
the
same
content)
in
system
memory
(also
known
as
“ptr-‐is-‐ptr”)
‒ On
OS
that
support
HSA
MMU
func[onality,
the
page
tables
may
be
even
shared
and
the
OS
may
support
na[ve
GPU
demand
paging
‒ The
GPU
may
s[ll
support
addi[onal
address
ranges
for
special
purposes
(e.g.
frame
buffer
memory,
LDS,
scratch,
…)
Managed
by
OS
&
Gfx
driver
Allocation
‒ The
GPU’s
virtual
address
page
table
mapping
is
set
to
a
process
address
view
of
the
memory
space
Mapped
via
CPU
MMU
Allocation
‒ Reads
and
updates
of
system
memory
from
one
will
cause
cache
line
flushes
or
line
invalida[on
on
the
other
processors
in
the
system
‒ SW
does
not
have
to
deal
with
explicit
cache
line
flushes
or
invalida[ons
for
such
transac[ons
anymore,
it
works
like
for
any
CPU
core
in
the
system
‒ This
fully
works
for
APUs,
where
GPU
and
CPU
have
access
to
the
same
system
memory
controller,
par[al
support
for
discrete
GPU
GPU
Virtual
Address
Space
Managed
by
OS
Kernel
Mode
Address
Space
‒ It’s
the
same
layout,
just
a
different
visualiza[on
(focus
on
bit47
☺)
‒ There
is
efficient
hardware
support
for
GPU
&
CPU
cache
coherency
on
memory
load/store
opera[ons
by
the
GPU
Process
VA
Space
(CPU)
2 64-‐1
User
Process
Space
! First,
let’s
redraw
the
address
layout
map
from
before…
0x12340000
Page
0x00000000
0x00000000
Process1
0x00000000
Process2
0x00000000
Process1
0x00000000
Process2
GPU
12. THERE
ARE
STILL
REASONS
FOR
THE
“BUFFERED
VIEW”
OF
MEMORY
! HSA
and
hUMA
are
very
useful
for
compute
jobs
and
graphics
data
oien
updated
by
the
host
CPU
‒ Allows
fine-‐grained
“interac[ve”
sharing
of
data
between
CPU
and
GPU
threads
without
requiring
prophylac[c
cache
flushes
and
other
synchroniza[on
! But
the
“direct
view”
and
access
to
common
memory
is
less
beneficial
for
other
graphics
data
‒ Many
graphics
algorithms
have
been
designed
with
an
“abstract”
or
“deferred”
view
of
memory,
focusing
on
“dimensional
addressing”
of
the
data
in
the
shaders
(e.g.
x/y/z,
u/w
coordinates)
‒ Many
GPUs
use
hardware-‐specific
texture
[ling
formats
that
are
op[mized
for
a
specific
memory
channel
layout
to
reach
maximum
performance,
complicated
to
address
by
soiware
in
a
general
way
‒ An
applica[on
may
have
mul[ple
graphics
contexts
concurrently
per
process
(per
API),
vs
just
one
for
“flat”
‒ A
lot
of
graphics
data
(e.g.
textures,
ver[ces,
et
al)
are
not
changing
oien
through
CPU
updates
‒ requiring
cache
coherency
increases
HW
access
overhead
for
liule
benefit
‒ Many
specialty
resources
(e.g.
Z-‐Buffer)
have
GPU-‐specific
implementa[on
with
no
“external”
visibility
‒ Leveraging
the
much
higher
performance
of
a
Discrete
GPU
and
its
frame
buffer
memory
is
somewhat
more
complicated,
if
an
applica[on
needs
to
deal
with
the
memory
loca[on
directly
!
Most
common
graphics
APIs
today
don’t
know
how
to
deal
with
virtual
addresses
‒ This
will
change
in
the
future
as
u[lizing
virtual
addresses
within
graphics
APIs
becomes
commonplace
12
|
BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|
NOVEMBER
13,
2013
|
APU13
13. GRAPHICS
INTEROPERATION
IS
IMPORTANT
! There
are
many
different
graphics/GPU
APIs
in
use,
using
buffers/resources
to
access
memory
‒ As
seen
before,
there
are
good
reasons
to
keep
the
content
in
“buffers”
either
due
to
legacy
or
performance
‒ It
also
may
not
make
sense
to
“waste”
virtual
address
space
e.g.
on
32bit
apps
for
resources
not
accessed
by
host
‒ But
this
may
also
makes
it
harder
to
access
the
content
from
either
CPU
or
through
a
“flat
addressing”
aware
GPU
! Explicit
interopera[on
APIs
to
tradi[onal
graphics
APIs
provides
two
views
of
a
resource
‒ The
transla[on
between
“handle
+
offset”
and
“flat
address”
is
dealt
within
the
run[me
and
driver
‒ The
transla[on
itself
may
be
straigh{orward
and
very
efficient
however
! Specialty
GPU
resources
(e.g.
LDS,
scratch)
may
be
mapped
into
the
“flat”
process
address
space,
but
may
not
be
accessible
by
the
CPU
host
since
they’re
not
accessible
from
the
“outside”
‒ This
is
no
different
than
some
other
system
memory
mappings
provided
by
the
OS
! Applica[ons
should
focus
on
an
efficient
processing
of
the
data
on
the
“compute”
with
a
dedicated
handover
to
the
“graphics”
side
when
appropriate
‒ As
graphics
APIs
are
updated
over
[me
to
take
advantage
of
flat
addressing
models
(e.g.
for
“bindless
textures”)
the
need
for
the
interopera[on
mechanisms
may
gradually
vanish
for
most
graphics
data
13
|
BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|
NOVEMBER
13,
2013
|
APU13
14. ADDITIONAL
CONSIDERATIONS
! A
lot
of
today’s
PC
systems
have
more
than
one
GPU
available
to
the
programmer
‒ Almost
all
of
todays
CPUs
are
actually
APUs
and
have
both
CPU
and
GPU
on
chip,
using
the
same
memory
controller
‒ On
performance
systems,
a
discrete
GPU
with
dedicated
frame
buffer
memory
may
be
present,
too
! The
integrated
GPU
may
support
cache
coherency
for
system
memory
updates
and
therefore
is
preferen[al
for
GPU
compute
tasks
via
e.g.
DirectCompute
or
OpenCL™
‒ The
performance
uplii
vs
CPU
may
differ,
but
typically
there
oien
is
a
>10
[mes
factor
for
vector
computa[ons
vs
equivalent
CPU
instruc[ons
! Discrete
GPU
can
focus
on
graphics
workload
accelera[on,
further
processing
the
data
pre-‐processed
by
either
host
CPU
or
integrated
GPU
for
further
uplii
‒ Dedicated
transfer
from/to
discrete
GPU
frame
buffer
‒ For
appropriate
compute
workloads,
consider
the
addi[onal
performance
uplii
through
compute
on
discrete
GPU
! The
controls
may
be
in
a
driver
as
part
of
collabora[ve
render
(e.g.
AMD
DualGraphics)
where
the
compute
processing
on
the
integrated
GPU
via
appropriate
APIs
interoperates
with
the
“graphics”
device
‒ The
graphics
driver
operates
in
a
“Crossfire”
mode
for
integrated
and
discrete
GPU
‒ Whereas
the
compute
device
operates
on
a
DirectCompute
or
OpenCL™
“device”
on
the
integrated
GPU
14
|
BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|
NOVEMBER
13,
2013
|
APU13
15. SUMMARY
! HSA
and
hUMA
substan[ally
simplifies
data
exchange
between
GPU
and
CPU,
processing
it
on
both
sides
‒ Benefits
from
a
flat
address
model
where
data
pointer
references
to
content
can
be
resolved
on
either
side
‒ It
works
best
for
compute-‐heavy
workload,
where
frequent
data
updates
and
result
retrieval
is
important
! There
are
s[ll
benefits
to
keep
some
graphics
data
in
a
“buffered”
address
mode
through
graphics
APIs
‒ Leverages
“specialty
caches”,
discrete
GPU
and
storage
within
the
GPU
that
is
op[mized
for
graphics
data
but
makes
it
“less
accessible”
for
CPU
host
access
! With
appropriate,
efficient
interopera[on
between
the
“buffered”
and
the
“flat”
resource
view
on
the
GPU
the
applica[on
can
easily
traverse
between
these
two
data
representa[ons
‒ An
HSA
compliant
GPU
allows
for
a
very
efficient
transla[on
between
these
two
representa[ons
‒ Current
compute
&
graphics
API’s
can
be
supported
in
this
scheme
‒ With
na[ve
support
for
a
“flat
model”
in
upcoming
modern
OS,
direct,
“flat”,
cache
coherent
references
to
memory
resources
will
become
easier
to
use
directly
over
[me,
reducing
the
need
for
explicit
transla[on
! Take
advantage
of
all
the
GPUs
and
all
the
memory
you
find
on
a
system!
‒ There’s
oien
more
than
one
and
all
have
their
advantages
15
|
BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|
NOVEMBER
13,
2013
|
APU13
16. WHERE
TO
FIND
MORE
INFORMATION
THIS
PRESENTATION
IS
ONLY
A
START…
! AMD
Accelerated
Parallel
Processing
(APP)
SDK:
‒ hup://developer.amd.com/tools-‐and-‐sdks/heterogeneous-‐compu[ng/amd-‐accelerated-‐parallel-‐processing-‐app-‐sdk/
‒ AMD
APP
SDK
is
a
complete
development
pla{orm,
providing
samples,
documenta[on
and
other
materials
to
quickly
get
you
started
using
OpenCL™,
Bolt
(Open
Source
C++
Template
Library
for
GPU
parallel
processing),
C+
+AMP
or
Aparapi
for
Java
applica[ons
! AMD
CodeXL:
‒ hup://developer.amd.com/tools-‐and-‐sdks/heterogeneous-‐compu[ng/codexl/
‒ A
powerful
tools
suite
for
Windows®
and
Linux®
heterogeneous
applica[on
debugging
and
profiling
‒ Works
standalone
and
e.g.
Integrated
as
Visual
Studio
extension
! AMD
Developer
Central:
hup://developer.amd.com
‒ Docs,
whitepapers,
tools;
Everything
you
want
to
know
and
need
to
write
performant
programs
on
heterogeneous
systems.
‒ It’s
not
about
either
CPU
or
GPU,
its
about
both…
16
|
BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|
NOVEMBER
13,
2013
|
APU13
17. GO
AHEAD
☺
17
|
BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|
NOVEMBER
13,
2013
|
APU13