HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD

HOW
TO
MAKE
THE
MOST
OF
GPU
ACCESSIBLE
MEMORY

PAUL
BLINZER

FELLOW,
SYSTEM
SOFTWARE,
AMD

THE
AGENDA

!  What’s
so
special
about
dealing
with
memory
and
a
GPU?

‒  The
programmer’s
view
of
memory

‒  Throwing
a
GPU
into
the
mix

‒  How
do
today’s
systems
deal
with
GPU
memory
access?

!  The
many
different
“types”
of
memory
today
and
ways
to
access

‒  The
various
places
to
find
and
best
use
them

‒  What
changes
with
HSA
and
hUMA?

‒  Why
“buffered”
view
of
memory
is
s[ll
important
and
how
to
deal
with
it

!  Where
to
find
more
informa[on?

!  Q
&
A

2
|

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|

NOVEMBER
13,
2013
|
APU13

WHAT’S
SO
SPECIAL
ABOUT
MEMORY
ACCESS
WITH
A
GPU?

THERE
ARE
SO
MANY
DIFFERENT
TYPES,
BUSES
AND
CACHES
INVOLVED…

LDS = Local Data Share
TU = Texture Unit
TC = Texture Cache
Discrete GPU

Accelerated Processing Unit (APU)
CPU

GPU
Instruction Cache

1..N Compute Units
CoreM
DC (L1)

Global Data Share

Instruction Cache

Global Data Share

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

L2 Cache

L2 Cache

HSA MMU (IOMMUv2)

Mem

Memory Controller

Memory (DDR3)

Memory (DDR3)

Cached
non-cacheable

Cached
non-cacheable

3
|

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|

NOVEMBER
13,
2013
|
APU13

PCIE
Memory Controller

Mem

L3

Memory (GDDR5)

Memory (GDDR5)

Mem

IC, FPU, L2

LDS
LDS

1..N Compute Units Core1
Core0
IC, FPU, L2
DC (L1)
DC (L1)
Core0
Core1
DC (L1)
DC (L1)
IC, FPU, L2

Mem

Constant Cache

Mem

CoreM-1
DC (L1)
1..N Compute Units

GPU

Constant Cache

Memory (GDDR5)

THE
TYPICAL
APPLICATION’S
VIEW
OF
MEMORY
(1)

A
“GEDANKENEXPERIMENT”,
COMBINING
EINSTEIN
AND
TRON:

IMAGINE
YOU
ARE
A
CPU
CORE
EXECUTING
AN
APPLICATION
THREAD,
ACCESSING
DATA…

‒  The
address
may
be
represented
by
a
32bit
or
64bit
(44/48bit)
wide
ptr
value

‒  The
memory
content
may
not
even
be
resident
in
physical
memory,
paged
in

from
backup
storage
when
accessed,
maybe
pushing

other
content
out

‒  CPU
caches
keep
an
oien
used
“working
set”
of
data
close
to
the
CPU
core’s

execu[on
units

‒  CPU
cache
coherency
mechanisms

invalidate
cache
content
when
“outside

forces
“
(typically
other
CPU
cores)
update
the
content
of
system
memory
at
a

given

address,
ensuring
that
each
CPU
core
sees
the
same
data

Noncanonical
VA
Range

2 48-‐1

2 47-‐1
FB
Aperture

2 47-‐1
System
Physical

Memory
Space
2 44-‐1
Page

0x78900000

Page

Page
Page

Page
Page
Page

0x12340000

Page
0x00000000
0x00000000
Process1

4
|

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|

NOVEMBER
13,
2013
|
APU13

Mapped
via
CPU
MMU

GPU
Buffer

‒  The
applica[on
code
has
a
“ﬂat”
view
of
memory,
can
allocate
memory

from
the
OS,
write
&
read
data
at
that
address,
etc.

Managed
by
OS

Allocation

‒  Each
CPU
core
may
operate
independently
on
a
“thread”
within
that
process

2 64-‐1

Kernel

Mode

Address
Space

‒  Each
applica[on
is
associated
with
a
process
and
the
OS
isolates
the

address
space
of
one
process
from
any
other
on
the
system,
this
is

enforced
by
hardware
(MMU
=
“Memory
Management
Unit”)

Process
VA
Space
(CPU)

User
Process

Space

!  Today’s
opera[ng
systems
have
an
applica[on
model
based
on
a

user
process
view
of
the
system

0x00000000
Process2

THE
TYPICAL
APPLICATION’S
VIEW
OF
MEMORY
(2)

NOW
LET’S
SEE
HOW
A
GPU
SEES
THAT
SAME
MEMORY
TODAY
AND
ADDS
TO
IT…

‒  Consider
the
resource
“handle
value
+
offset”
as
just
a
special
kind
of
“address”
outside
of
the

regular
process
address
space
☺

5
|

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|

NOVEMBER
13,
2013
|
APU13

Alloc.
Gfx

Noncanonical
VA
Range

Framebuffer

0x98765000

Page

0x00000000

2 47-‐1
System
Physical

Memory
Space
2 44-‐1
Page

0x78900000

Page

Page
Page

GPU
Buffer

‒  To
access
the
memory
content
(all
or
part
of
it),
an
API
provides
func[ons
like

MapResourceView(),
Lock(),
Unlock()
or
similar
,establishing
“windows”
in
the

address
space
to
that
memory
either
to
GPU
or
CPU
or
put
into
staging
buffers

Mapped
via
GPU
MMU

Page

2 47-‐1

User
Process

Space

‒  The
API
typically
only
provides
a
“handle”
referencing
the
object

GPU
physical
memory
(e.g.
discrete)

2 48-‐1

‒  CreateResource(),
CreateBuffer(),
CreateTexture()…

‒  The
memory
is
managed
as
single
objects
(buffers,
resources,
textures,
…),

typically,
“malloc()-‐ed”
memory
is
typically
not
directly
accessible
by
GPU

Mapped
via
CPU
MMU

FB
Aperture

!  GPU
accessible
memory
alloca[ons
are
handled
via
special
APIs

(DirectX,
OpenGL,OpenCL,
etc)

Managed
by
Gfx
Driver

Managed
by
OS

GPU
Buffer

‒  GPU
accessible
system
memory
is
“page-‐locked”
and
can’t
move
while
the

memory
may
be
accessible
by
the
GPU,
even
if
it’s
currently
not
used
at
all

‒  The
total
amount
of
memory
a
GPU
can
access
at
a
[me
is
limited
to
the

amount
of
page-‐locked
memory
or
frame
buffer
memory

2 64-‐1

GPU
Virtual
Address
Space

Page

Allocation

‒  They
can
only
access
physical
memory
pages
as
far
as
the
OS
memory

management
is
concerned,
though
GPU
may
use
“virtual
addresses”

Process
VA
Space
(CPU)

Kernel

Mode

Address
Space

!  GPUs
are
typically
managed
as
devices
by
opera[ng
systems:

Page

0x56780000

Page

0x12340000

Page
0x00000000
0x00000000
Process1

0x00000000
Process2

0x00000000

GPU

THE
TYPICAL
APPLICATION’S
VIEW
OF
MEMORY
(3)

NOW
LET’S
SEE
HOW
A
GPU
SEES
THAT
SAME
MEMORY
TODAY
AND
ADDS
TO
IT…

6
|

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|

NOVEMBER
13,
2013
|
APU13

Mapped
via
GPU
MMU

0x98765000

Page
Page

2 47-‐1

0x00000000

FB
Aperture

2 47-‐1
System
Physical

Memory
Space
2 44-‐1
Page

0x78900000

Page

‒  depending
on
the
system
configura[on
(e.g.
PCIe
bus
access)

Page

GPU
Buffer

Page

Page

Allocation

User
Process

Space

‒  GPU
caches
are
typically

explicitly
managed
by
the
driver
and
need
to
be

refreshed
when
the
CPU
updates
memory
content

‒  One
reason
is
hardware
complexity
to
make
this
performant

‒  Depending
on
use
scenario,
the
GPU
accessible
memory
is
mapped
as

“writethrough”,
“uncached”
or
“writecombined”
by
the
OS
APIs

GPU
physical
memory
(e.g.
discrete)

2 48-‐1

!  Data
visibility
(cache
coherency)
is
typically
soiware-‐managed

‒  CPU
cache
coherency,
when
accessing

system
memory
poten[ally

updated
by
a
GPU
may
not
be
always
guaranteed

Mapped
via
CPU
MMU

Kernel

Mode

Address
Space

‒  For
frequent
accesses
from
both
CPU
&
GPU,
the
transla[on
can
be
tediously

slow

‒  Content
that
can
be
accessed
by
both
CPU
and
GPU
simultaneously
needs

data
visibility/coherency
rules
leading
to
the
next
issue…

Managed
by
Gfx
Driver

Managed
by
OS

Framebuffer

‒  The
bad
thing
about
it
is
that
it’s
an
either/or
style
access

GPU
Virtual
Address
Space

Alloc.
Gfx

2 64-‐1
Noncanonical
VA
Range

‒  where
it
can
be
more
efficiently
stored
or
processed
(e.g.
2D
[ling)

Process
VA
Space
(CPU)

GPU
Buffer

‒  The
good
thing
about
an
API
controlled
access
is
that
the
OS
and
&
driver

can
copy
the
content
to
someplace
else
and/or
to
a
different
format

Page

0x56780000

Page

0x12340000

Page
0x00000000
0x00000000
Process1

0x00000000
Process2

0x00000000

GPU

IT’S
ALL
ABOUT
THROUGHPUT,
BANDWIDTH
AND
LATENCY…

KEEP
YOUR
DATA
CLOSE
AND
YOUR
FREQUENTLY
USED
DATA
EVEN
CLOSER…

LDS = Local Data Share
TU = Texture Unit
TC = Texture Cache

Bandwidth: 100's GB/s
Latency: <1-10's cycles

Bandwidth: 100's-1000's GB/s
Latency: <1-10's cycles
Discrete GPU

Accelerated Processing Unit (APU)
CPU

GPU
Instruction Cache

CoreM
DC (L1)

Global Data Share

Constant Cache

Global Data Share

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

LDS

IC, FPU, L2

LDS
LDS

1..N Compute Units Core1
Core0
IC, FPU, L2
DC (L1)
DC (L1)
Core0
Core1
DC (L1)
DC (L1)
IC, FPU, L2

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

~15 GB/s
X16 PCI-E 3.0

L2 Cache

L3

IOMMUv2

~17 GB/s
(DDR3-2133)

Mem

Memory Controller

Mem

Instruction Cache

~17 GB/s
(DDR3-2133)

Memory (DDR3)

Memory (DDR3)

Cached
non-cacheable

Cached
non-cacheable

7
|

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|

NOVEMBER
13,
2013
|
APU13

L2 Cache

PCIE
Memory Controller

Latency: 10's-100's of cycles
(mem, bus)

Mem

CoreM-1
DC (L1)
1..N Compute Units

GPU

Constant Cache

~90GB/s
(3GHz MCLK)

Memory (GDDR5)

~90GB/s
(3GHz MCLK)

Memory (GDDR5)

Mem

1..N Compute Units

Mem

Caches:

~90GB/s
(3GHz MCLK)

Memory (GDDR5)

IT’S
ALL
ABOUT
THE
RIGHT
TOOL
FOR
THE
JOB(1)

!  The
efficient
use
of
a
GPU
&
CPU
in
a
system
depends
understanding
their
opera[on
on
memory

‒  The
cache
architecture
on
either
CPU
and
GPU
is
a
reflec[on
of
the
different
access
pauerns
for
their
“preferred”

workloads
and
data
and
so
is
the
cache
management/op[miza[on

!  CPU’s
are
typically
built
to
operate
on
general
purpose,
serial
instruc[on
threads,
oien
high
data
locality,

lot’s
of
condi[onal
execu[on
and
dealing
with
data
interdependency

‒  CPU
cache
hierarchy
is
focused
on
general
purpose
data
access
from/to
execu[on
units,
feeding
back
previously

computed
data
to
the
execu[on
units
with
very
low
latency

‒  Compara[vely
few
registers
(vs
GPUs),
but
large
caches
keep
oien
used
“arbitrary”
data
close
to
the
execu[on

units

!  GPUs
are
usually
built
for
a
SIMD
execu[on
model

‒  Apply
the
same
sequence
of
instruc[ons
over
and
over
on
data
with
liule
varia[on
but
high
throughput
(“streaming

data”),
passing
the
data
from
one
processing
stage
to
another
(latency
tolerance)

‒  Compute
units
have
a
rela[vely
large
register
file
store

‒  Using
a
lot
of
“specialty
caches”
(constant
cache,
Texture
Cache,
etc),
data
caches
op[mized
for
SW
data
prefetch

‒  LDS,
GDS
mainly
used
for
in-‐wavefront
or
inter-‐wavefront
updates
&
synchroniza[on

‒  Data
caches
are
typically
explicitly
flushed
by
soiware

8
|

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|

NOVEMBER
13,
2013
|
APU13

IT’S
ALL
ABOUT
THE
RIGHT
TOOL
FOR
THE
JOB(2)

!  The
GPU
memory
&
cache
access

design
is
well-‐suited
for
typical
2D

&
3D
graphics
workloads
(duh!)

‒  Ver[ces
data,
Textures,
etc
are
passed
from
the
host
to
the
various
stages

of
the
graphics
API
pipeline,
with
each
stage
allowing
processing
of
the

data
passing
through
via
appropriate
instruc[on
sequences
(“shaders”)

‒  Since
a
lot
of
the
data
is
“sta[c”
and
the
access
is
abstracted
via
APIs,
it
can

be
put
into
beuer
suited
data
formats
mapping
2D/3D
pixel
coordinates

“locality”
to
memory
locality
in
internal
buffers
within
the
graphics
pipeline

2D
Tiling

Y-‐Coordinate

X-‐Coordinate

16x16

16x16

16x16

16x16

16x16

...

‒  Very
beneficial
for
performance,
but
not
easily
“accessible”
by
simple
addressing

schemes,
requires
copy
of
the
data
first

‒  Today’s
graphics
APIs
(OpenGL,
Direct3D
are
well
suited
for
this
workload,

but
oien
must
focus
on
the
lowest-‐common
denominator
in
hardware

capabili[es

‒  The
API
design
assumes
that
no
cache
coherency
between
CPU
and
GPU

may
exist,
requiring
the
CPU
to
issue
explicit
cache
flushes
or
operate
on

memory
areas
mapped
as
“uncached”
if
readback
of
GPU
data
is
required

‒  Some
extensions
or
recently
introduced
features
for
“zero
copy”
memory

9
|

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|

NOVEMBER
13,
2013
|
APU13

X0,Y0

X1,Y0

X2,Y0

Memory
Addresses

...

X15,Y0

X0,Y1

X1,Y1

X2,Y1

...

X15,Y14

X0,Y15

X1,Y15

X2,Y15

...

IT’S
ALL
ABOUT
THE
RIGHT
TOOL
FOR
THE
JOB(3)

!  Vector/Matrix-‐oriented
compute
workloads
map
well

to
GPUs,
but
un[l
now
“suffer”
from
some
of
the

choices
that
benefit
the
graphics
data
processing
flow

‒  Compute
APIs
like
OpenCL™
or
DirectCompute
are
oien
s[ll
inherently
[ed
to
the
low-‐level
graphics
focused
GPU

infrastructure
in
today’s
OS
(e.g.
memory
management
through
Microsoi®
WDDM,
Linux®
TTM/GEM)

‒  “Zero
Copy”
Support
and
system
memory
buffer
cache
coherency
in
recent
API’s
improves
the
behavior
on
some

pla{orms
that
have
appropriate
support,
s[ll
has
some
SW
overhead
for
access

‒  All
the
memory
processed
by
the
GPU
is
referenced
through
handles
to
control
memory
page-‐lock
on
workload

dispatch
and
the
SW
needs
to
create
“Buffer
views”
either
explicitly
or
under
the
covers
to
access
regular
memory

‒  There
is
quite
some
SW
overhead
involved
in
that

!  Discrete
GPU
have
excellent
compute
performance
(several
TeraFLOPS
for
even
mid-‐range
cards)

‒  But
require
the
data
to
be
accessible
in
local
memory
for
best
performance,
requiring
copy-‐opera[ons
from
host

memory
and
“keeping
the
data
on
the
other
side”
as
long
as
possible

‒  Accessing
or
pushing
the
data
back
and
forth
through
the
PCIe
bouleneck
may
reduce
or
eliminate
speedup-‐gains
or

increases
access
latency
from
host
substan[ally

10
|

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|

NOVEMBER
13,
2013
|
APU13

HOW
DOES
HUMA

AND
HSA
CHANGE
THINGS
?

‒  Pla{orm
atomics
are
supported,
for
efficient
synchroniza[on

11
|

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|

NOVEMBER
13,
2013
|
APU13

Alloc.
Gfx

Framebuffer

Noncanonical
VA
Range

0x98765000

Page
Page

2 47-‐1

0x00000000

FB
Aperture

2 47-‐1
System
Physical

Memory
Space
2 -‐1

2 47-‐1
FB
Aperture

2 47-‐1

Mapped
via
HSA
MMU

GPU
physical
memory
(e.g.
discrete)

44

Page
0x78900000

GPU
Buffer

0x78900000

GPU
Buffer

Page

Page
Page

Page
Page
Page

0x12340000

User
Process

Space

‒  A
data
pointer
has
the
same
“meaning”
(=
points
to
the
same
content)
in

system
memory
(also
known
as
“ptr-‐is-‐ptr”)

‒  On
OS
that
support
HSA
MMU
func[onality,
the
page
tables

may
be
even

shared
and
the
OS
may
support
na[ve
GPU
demand

paging

‒  The
GPU
may
s[ll
support
addi[onal
address
ranges
for
special
purposes

(e.g.
frame
buffer
memory,
LDS,
scratch,
…)

Managed
by
OS
&
Gfx
driver

Allocation

‒  The
GPU’s
virtual
address
page
table
mapping
is
set
to
a
process

address
view
of
the
memory
space

Mapped
via
CPU
MMU

Allocation

‒  Reads
and
updates
of
system
memory
from
one
will
cause

cache
line

flushes
or
line
invalida[on
on
the
other
processors
in
the
system

‒  SW
does
not
have
to
deal
with
explicit
cache
line
flushes
or
invalida[ons

for
such
transac[ons
anymore,
it
works
like
for
any
CPU
core
in
the
system

‒  This
fully
works
for
APUs,
where
GPU
and
CPU
have
access
to
the
same

system
memory
controller,
par[al
support
for
discrete
GPU

GPU
Virtual
Address
Space
Managed
by
OS

Kernel

Mode

Address
Space

‒  It’s
the
same
layout,
just
a
different
visualiza[on
(focus
on
bit47
☺)

‒  There
is
efficient
hardware
support
for
GPU
&
CPU
cache
coherency

on
memory
load/store
opera[ons
by
the
GPU

Process
VA
Space
(CPU)
2 64-‐1

User
Process

Space

!  First,
let’s
redraw
the
address
layout
map
from
before…

0x12340000
Page
0x00000000

0x00000000
Process1

0x00000000
Process2

0x00000000
Process1

0x00000000
Process2

GPU

THERE
ARE
STILL
REASONS
FOR
THE
“BUFFERED
VIEW”
OF
MEMORY

!  HSA
and
hUMA
are
very
useful
for
compute
jobs
and
graphics
data
oien
updated
by
the
host
CPU

‒  Allows
fine-‐grained
“interac[ve”
sharing
of
data
between
CPU
and
GPU
threads
without
requiring
prophylac[c

cache
flushes
and
other
synchroniza[on

!  But
the
“direct
view”
and
access
to
common
memory
is
less
beneficial
for
other
graphics
data

‒  Many
graphics
algorithms
have
been
designed
with
an
“abstract”
or
“deferred”
view
of
memory,
focusing
on

“dimensional
addressing”
of
the
data
in
the
shaders
(e.g.
x/y/z,
u/w
coordinates)

‒  Many
GPUs
use
hardware-‐specific
texture
[ling
formats
that
are
op[mized
for
a
specific
memory
channel
layout
to

reach
maximum
performance,
complicated
to
address
by
soiware
in
a
general
way

‒  An
applica[on
may
have
mul[ple
graphics
contexts
concurrently
per
process
(per
API),
vs
just
one
for
“flat”

‒  A
lot
of
graphics
data
(e.g.
textures,
ver[ces,
et
al)
are
not
changing
oien
through
CPU
updates

‒  requiring
cache
coherency
increases
HW
access
overhead
for
liule
benefit

‒  Many
specialty
resources
(e.g.
Z-‐Buffer)
have
GPU-‐specific
implementa[on
with
no
“external”
visibility

‒  Leveraging
the
much
higher
performance
of
a
Discrete
GPU
and
its
frame
buffer
memory
is
somewhat
more

complicated,
if
an
applica[on
needs
to
deal
with
the
memory
loca[on
directly

! 
Most
common
graphics
APIs
today
don’t
know
how
to
deal
with
virtual
addresses

‒  This
will
change
in
the
future
as
u[lizing
virtual
addresses
within
graphics
APIs
becomes
commonplace

12
|

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|

NOVEMBER
13,
2013
|
APU13

GRAPHICS
INTEROPERATION
IS
IMPORTANT

!  There
are
many
different
graphics/GPU
APIs
in
use,
using
buffers/resources
to
access
memory

‒  As
seen
before,
there
are
good
reasons
to
keep
the
content
in
“buffers”
either
due
to
legacy
or
performance

‒  It
also
may
not
make
sense
to
“waste”
virtual
address
space
e.g.
on
32bit
apps
for
resources
not
accessed
by
host

‒  But
this
may
also
makes
it
harder
to
access
the
content

from
either
CPU
or
through
a
“flat
addressing”
aware
GPU

!  Explicit
interopera[on
APIs
to
tradi[onal
graphics
APIs
provides
two
views
of
a
resource

‒  The
transla[on
between
“handle
+
offset”
and
“flat
address”
is
dealt
within
the
run[me
and
driver

‒  The
transla[on
itself
may
be
straigh{orward
and
very
efficient
however

!  Specialty
GPU
resources
(e.g.
LDS,
scratch)
may
be
mapped
into
the
“flat”
process
address
space,
but
may

not
be
accessible
by
the
CPU
host
since
they’re
not
accessible
from
the
“outside”

‒  This
is
no
different
than
some
other
system
memory
mappings

provided
by
the
OS

!  Applica[ons
should
focus
on
an
efficient
processing
of
the
data
on
the
“compute”
with
a
dedicated

handover
to
the
“graphics”
side
when
appropriate

‒  As
graphics
APIs
are
updated
over
[me
to
take
advantage
of
flat
addressing
models
(e.g.
for
“bindless
textures”)
the

need
for
the
interopera[on
mechanisms
may
gradually
vanish
for
most
graphics
data

13
|

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|

NOVEMBER
13,
2013
|
APU13

ADDITIONAL
CONSIDERATIONS

!  A
lot
of
today’s
PC
systems
have
more
than
one
GPU
available
to
the
programmer

‒  Almost
all
of
todays
CPUs
are
actually
APUs
and
have
both
CPU
and
GPU
on
chip,
using
the
same
memory
controller

‒  On
performance
systems,
a
discrete
GPU
with
dedicated
frame
buffer
memory
may
be
present,
too

!  The
integrated
GPU
may
support
cache
coherency
for
system
memory
updates
and
therefore
is

preferen[al
for
GPU
compute
tasks
via
e.g.
DirectCompute
or
OpenCL™

‒  The
performance
uplii
vs
CPU
may
differ,
but
typically
there
oien
is
a
>10
[mes
factor
for
vector
computa[ons
vs

equivalent
CPU
instruc[ons

!  Discrete
GPU
can
focus
on
graphics
workload
accelera[on,
further
processing
the
data
pre-‐processed
by

either
host
CPU
or
integrated
GPU
for
further
uplii

‒  Dedicated
transfer
from/to
discrete
GPU
frame
buffer

‒  For
appropriate
compute
workloads,
consider
the
addi[onal
performance
uplii
through
compute
on
discrete
GPU

!  The
controls
may
be
in
a
driver
as
part
of
collabora[ve
render
(e.g.
AMD
DualGraphics)
where
the

compute
processing
on
the
integrated
GPU
via
appropriate
APIs
interoperates
with
the
“graphics”
device

‒  The
graphics
driver
operates
in
a
“Crossfire”
mode
for
integrated
and
discrete
GPU

‒  Whereas
the
compute
device
operates
on
a
DirectCompute
or
OpenCL™
“device”
on
the
integrated
GPU

14
|

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|

NOVEMBER
13,
2013
|
APU13

SUMMARY

!  HSA
and
hUMA
substan[ally
simplifies
data
exchange
between
GPU
and
CPU,
processing
it
on
both
sides

‒  Benefits
from
a
flat
address
model
where
data
pointer
references
to
content
can
be
resolved
on
either
side

‒  It
works
best
for
compute-‐heavy
workload,
where
frequent
data
updates
and
result
retrieval
is
important

!  There
are
s[ll
benefits
to
keep
some
graphics
data
in
a
“buffered”
address
mode
through
graphics
APIs

‒  Leverages
“specialty
caches”,
discrete
GPU
and
storage
within
the
GPU
that
is
op[mized
for
graphics
data
but

makes
it
“less
accessible”
for
CPU
host
access

!  With
appropriate,
efficient
interopera[on
between
the
“buffered”
and
the
“flat”
resource
view
on
the

GPU
the
applica[on
can
easily
traverse
between
these
two
data
representa[ons

‒  An
HSA
compliant
GPU
allows
for
a
very
efficient
transla[on
between
these
two
representa[ons

‒  Current
compute
&
graphics
API’s
can
be
supported
in
this
scheme

‒  With
na[ve
support
for
a
“flat
model”
in
upcoming
modern
OS,
direct,
“flat”,
cache
coherent
references
to
memory

resources
will
become
easier
to
use
directly
over
[me,
reducing
the
need
for
explicit
transla[on

!  Take
advantage
of
all
the
GPUs
and
all
the
memory
you
find
on
a
system!

‒  There’s
oien
more
than
one
and
all
have
their
advantages

15
|

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|

NOVEMBER
13,
2013
|
APU13

WHERE
TO
FIND
MORE
INFORMATION

THIS
PRESENTATION
IS
ONLY
A
START…

!  AMD
Accelerated
Parallel
Processing
(APP)
SDK:

‒  hup://developer.amd.com/tools-‐and-‐sdks/heterogeneous-‐compu[ng/amd-‐accelerated-‐parallel-‐processing-‐app-‐sdk/

‒  AMD
APP
SDK
is
a
complete

development
pla{orm,
providing
samples,
documenta[on
and
other
materials
to

quickly
get
you
started
using
OpenCL™,
Bolt
(Open
Source
C++
Template

Library
for

GPU
parallel
processing),

C+
+AMP
or
Aparapi
for
Java
applica[ons

!  AMD
CodeXL:

‒  hup://developer.amd.com/tools-‐and-‐sdks/heterogeneous-‐compu[ng/codexl/

‒  A
powerful
tools
suite
for
Windows®
and
Linux®
heterogeneous
applica[on

debugging
and
proﬁling

‒  Works
standalone
and

e.g.

Integrated
as
Visual
Studio
extension

!  AMD
Developer
Central:
hup://developer.amd.com

‒  Docs,
whitepapers,
tools;
Everything
you
want
to
know
and
need
to
write
performant
programs

on
heterogeneous

systems.

‒  It’s
not
about
either
CPU
or
GPU,
its
about
both…

16
|

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|

NOVEMBER
13,
2013
|
APU13

GO
AHEAD
☺

17
|

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|

NOVEMBER
13,
2013
|
APU13

DISCLAIMER
&
ATTRIBUTION

The
informa[on
presented
in
this
document
is
for
informa[onal
purposes
only
and
may
contain
technical
inaccuracies,
omissions
and
typographical
errors.

The
informa[on
contained
herein
is
subject
to
change
and
may
be
rendered
inaccurate
for
many
reasons,
including
but
not
limited
to
product
and
roadmap

changes,
component
and
motherboard
version
changes,
new
model
and/or
product
releases,
product
differences
between
differing
manufacturers,
soiware

changes,
BIOS
flashes,
firmware
upgrades,
or
the
like.
AMD
assumes
no
obliga[on
to
update
or
otherwise
correct
or
revise
this
informa[on.
However,
AMD

reserves
the
right
to
revise
this
informa[on
and
to
make
changes
from
[me
to
[me
to
the
content
hereof
without
obliga[on
of
AMD
to
no[fy
any
person
of

such
revisions
or
changes.

AMD
MAKES
NO
REPRESENTATIONS
OR
WARRANTIES
WITH
RESPECT
TO
THE
CONTENTS
HEREOF
AND
ASSUMES
NO
RESPONSIBILITY
FOR
ANY

INACCURACIES,
ERRORS
OR
OMISSIONS
THAT
MAY
APPEAR
IN
THIS
INFORMATION.

AMD
SPECIFICALLY
DISCLAIMS
ANY
IMPLIED
WARRANTIES
OF
MERCHANTABILITY
OR
FITNESS
FOR
ANY
PARTICULAR
PURPOSE.
IN
NO
EVENT
WILL
AMD
BE

LIABLE
TO
ANY
PERSON
FOR
ANY
DIRECT,
INDIRECT,
SPECIAL
OR
OTHER
CONSEQUENTIAL
DAMAGES
ARISING
FROM
THE
USE
OF
ANY
INFORMATION

CONTAINED
HEREIN,
EVEN
IF
AMD
IS
EXPRESSLY
ADVISED
OF
THE
POSSIBILITY
OF
SUCH
DAMAGES.

ATTRIBUTION

©
2013
Advanced
Micro
Devices,
Inc.
All
rights
reserved.
AMD,
the
AMD
Arrow
logo
and
combina[ons
thereof
are
trademarks
of
Advanced
Micro
Devices,

Inc.
in
the
United
States
and/or
other
jurisdic[ons.
OpenCL
is
a
trademark
of
Apple
Corp.
and
Linux
is
a
trademark
of
Linus
Torvalds
and
Microsoi
is
a

trademark
of
Microsoi
Corp.
PCI
Express
is
a
trademark
of
PCI
SIG
Corpora[on.
Other
names
are
for
informa[onal
purposes
only
and
may
be
trademarks
of

their
respec[ve
owners.

18
|

BEING
SPECIAL
IN
A
UNIFIED
MEMORY
WORLD
|

NOVEMBER
13,
2013
|
APU13

HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

Similar to HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer (20)

More from AMD Developer Central

More from AMD Developer Central (20)

Recently uploaded

Recently uploaded (20)

HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer