SlideShare a Scribd company logo
1 of 18
Download to read offline
BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  	
  
HOW	
  TO	
  MAKE	
  THE	
  MOST	
  OF	
  GPU	
  ACCESSIBLE	
  MEMORY	
  
PAUL	
  BLINZER	
  
FELLOW,	
  SYSTEM	
  SOFTWARE,	
  AMD	
  
THE	
  AGENDA	
  
!  What’s	
  so	
  special	
  about	
  dealing	
  with	
  memory	
  and	
  a	
  GPU?	
  
‒  The	
  programmer’s	
  view	
  of	
  memory	
  
‒  Throwing	
  a	
  GPU	
  into	
  the	
  mix	
  
‒  How	
  do	
  today’s	
  systems	
  deal	
  with	
  GPU	
  memory	
  access?	
  

!  The	
  many	
  different	
  “types”	
  of	
  memory	
  today	
  and	
  ways	
  to	
  access	
  
‒  The	
  various	
  places	
  to	
  find	
  and	
  best	
  use	
  them	
  
‒  What	
  changes	
  with	
  HSA	
  and	
  hUMA?	
  
‒  Why	
  “buffered”	
  view	
  of	
  memory	
  is	
  s[ll	
  important	
  and	
  how	
  to	
  deal	
  with	
  it	
  

!  Where	
  to	
  find	
  more	
  informa[on?	
  
!  Q	
  &	
  A	
  

2	
   |	
  	
  	
  BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  |	
  	
  	
  NOVEMBER	
  13,	
  2013	
  |	
  APU13	
  
WHAT’S	
  SO	
  SPECIAL	
  ABOUT	
  MEMORY	
  ACCESS	
  WITH	
  A	
  GPU?	
  
THERE	
  ARE	
  SO	
  MANY	
  DIFFERENT	
  TYPES,	
  BUSES	
  AND	
  CACHES	
  INVOLVED…	
  
LDS = Local Data Share
TU = Texture Unit
TC = Texture Cache
Discrete GPU

Accelerated Processing Unit (APU)
CPU

GPU
Instruction Cache

1..N Compute Units
CoreM
DC (L1)

Global Data Share

Instruction Cache

Global Data Share

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

L2 Cache

L2 Cache

HSA MMU (IOMMUv2)

Mem

Memory Controller

Memory (DDR3)

Memory (DDR3)

Cached
non-cacheable

Cached
non-cacheable

3	
   |	
  	
  	
  BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  |	
  	
  	
  NOVEMBER	
  13,	
  2013	
  |	
  APU13	
  

PCIE
Memory Controller

Mem

L3

Memory (GDDR5)

Memory (GDDR5)

Mem

IC, FPU, L2

LDS
LDS

1..N Compute Units Core1
Core0
IC, FPU, L2
DC (L1)
DC (L1)
Core0
Core1
DC (L1)
DC (L1)
IC, FPU, L2

Mem

Constant Cache

Mem

CoreM-1
DC (L1)
1..N Compute Units

GPU

Constant Cache

Memory (GDDR5)
THE	
  TYPICAL	
  APPLICATION’S	
  VIEW	
  OF	
  MEMORY	
  (1)	
  

A	
  “GEDANKENEXPERIMENT”,	
  COMBINING	
  EINSTEIN	
  AND	
  TRON:	
  
IMAGINE	
  YOU	
  ARE	
  A	
  CPU	
  CORE	
  EXECUTING	
  AN	
  APPLICATION	
  THREAD,	
  ACCESSING	
  DATA…	
  

‒  The	
  address	
  may	
  be	
  represented	
  by	
  a	
  32bit	
  or	
  64bit	
  (44/48bit)	
  wide	
  ptr	
  value	
  	
  
‒  The	
  memory	
  content	
  may	
  not	
  even	
  be	
  resident	
  in	
  physical	
  memory,	
  paged	
  in	
  
from	
  backup	
  storage	
  when	
  accessed,	
  maybe	
  pushing	
  	
  other	
  content	
  out	
  
‒  CPU	
  caches	
  keep	
  an	
  oien	
  used	
  “working	
  set”	
  of	
  data	
  close	
  to	
  the	
  CPU	
  core’s	
  
execu[on	
  units	
  
‒  CPU	
  cache	
  coherency	
  mechanisms	
  	
  invalidate	
  cache	
  content	
  when	
  “outside	
  
forces	
  “	
  (typically	
  other	
  CPU	
  cores)	
  update	
  the	
  content	
  of	
  system	
  memory	
  at	
  a	
  
given	
  	
  address,	
  ensuring	
  that	
  each	
  CPU	
  core	
  sees	
  the	
  same	
  data	
  

Noncanonical	
  VA	
  Range

2 48-­‐1

2 47-­‐1
FB
Aperture

2 47-­‐1
System	
  Physical	
  
Memory	
  Space
2 44-­‐1
Page

0x78900000

Page

Page
Page

Page
Page
Page

0x12340000

Page
0x00000000
0x00000000
Process1

4	
   |	
  	
  	
  BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  |	
  	
  	
  NOVEMBER	
  13,	
  2013	
  |	
  APU13	
  

Mapped	
  via
CPU	
  MMU

GPU
Buffer

‒  The	
  applica[on	
  code	
  has	
  a	
  “flat”	
  view	
  of	
  memory,	
  can	
  allocate	
  memory	
  
from	
  the	
  OS,	
  write	
  &	
  read	
  data	
  at	
  that	
  address,	
  etc.	
  

Managed	
  by	
  OS

Allocation

‒  Each	
  CPU	
  core	
  may	
  operate	
  independently	
  on	
  a	
  “thread”	
  within	
  that	
  process	
  

2 64-­‐1

Kernel	
  
Mode	
  
Address
Space	
  

‒  Each	
  applica[on	
  is	
  associated	
  with	
  a	
  process	
  and	
  the	
  OS	
  isolates	
  the	
  
address	
  space	
  of	
  one	
  process	
  from	
  any	
  other	
  on	
  the	
  system,	
  this	
  is	
  
enforced	
  by	
  hardware	
  (MMU	
  =	
  “Memory	
  Management	
  Unit”)	
  

Process	
  VA
Space	
  (CPU)

User	
  Process
	
  Space

!  Today’s	
  opera[ng	
  systems	
  have	
  an	
  applica[on	
  model	
  based	
  on	
  a	
  
user	
  process	
  view	
  of	
  the	
  system	
  

0x00000000
Process2
THE	
  TYPICAL	
  APPLICATION’S	
  VIEW	
  OF	
  MEMORY	
  (2)	
  

NOW	
  LET’S	
  SEE	
  HOW	
  A	
  GPU	
  SEES	
  THAT	
  SAME	
  MEMORY	
  TODAY	
  AND	
  ADDS	
  TO	
  IT…	
  

‒  Consider	
  the	
  resource	
  “handle	
  value	
  +	
  offset”	
  as	
  just	
  a	
  special	
  kind	
  of	
  “address”	
  outside	
  of	
  the	
  
regular	
  process	
  address	
  space	
  ☺	
  

	
  

5	
   |	
  	
  	
  BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  |	
  	
  	
  NOVEMBER	
  13,	
  2013	
  |	
  APU13	
  

Alloc.
Gfx

Noncanonical	
  VA	
  Range

Framebuffer

0x98765000

Page

0x00000000

2 47-­‐1
System	
  Physical	
  
Memory	
  Space
2 44-­‐1
Page

0x78900000

Page

Page
Page

GPU
Buffer

‒  To	
  access	
  the	
  memory	
  content	
  (all	
  or	
  part	
  of	
  it),	
  an	
  API	
  provides	
  func[ons	
  like	
  
MapResourceView(),	
  Lock(),	
  Unlock()	
  or	
  similar	
  ,establishing	
  “windows”	
  in	
  the	
  
address	
  space	
  to	
  that	
  memory	
  either	
  to	
  GPU	
  or	
  CPU	
  or	
  put	
  into	
  staging	
  buffers	
  

Mapped	
  via
GPU	
  MMU

Page

2 47-­‐1

User	
  Process
	
  Space

‒  The	
  API	
  typically	
  only	
  provides	
  a	
  “handle”	
  referencing	
  the	
  object	
  

GPU	
  physical
memory
(e.g.	
  discrete)

2 48-­‐1

‒  CreateResource(),	
  CreateBuffer(),	
  CreateTexture()…	
  	
  
‒  The	
  memory	
  is	
  managed	
  as	
  single	
  objects	
  (buffers,	
  resources,	
  textures,	
  …),	
  
typically,	
  “malloc()-­‐ed”	
  memory	
  is	
  typically	
  not	
  directly	
  accessible	
  by	
  GPU	
  

Mapped	
  via
CPU	
  MMU

FB
Aperture

!  GPU	
  accessible	
  memory	
  alloca[ons	
  are	
  handled	
  via	
  special	
  APIs	
  
(DirectX,	
  OpenGL,OpenCL,	
  etc)	
  

Managed	
  by
Gfx	
  Driver

Managed	
  by	
  OS

GPU
Buffer

‒  GPU	
  accessible	
  system	
  memory	
  is	
  “page-­‐locked”	
  and	
  can’t	
  move	
  while	
  the	
  
memory	
  may	
  be	
  accessible	
  by	
  the	
  GPU,	
  even	
  if	
  it’s	
  currently	
  not	
  used	
  at	
  all	
  
‒  The	
  total	
  amount	
  of	
  memory	
  a	
  GPU	
  can	
  access	
  at	
  a	
  [me	
  is	
  limited	
  to	
  the	
  
amount	
  of	
  page-­‐locked	
  memory	
  or	
  frame	
  buffer	
  memory	
  	
  

2 64-­‐1

GPU
Virtual	
  Address	
  Space

Page

Allocation

‒  They	
  can	
  only	
  access	
  physical	
  memory	
  pages	
  as	
  far	
  as	
  the	
  OS	
  memory	
  
management	
  is	
  concerned,	
  though	
  GPU	
  may	
  use	
  “virtual	
  addresses”	
  

Process	
  VA
Space	
  (CPU)

Kernel	
  
Mode	
  
Address
Space	
  

!  GPUs	
  are	
  typically	
  managed	
  as	
  devices	
  by	
  opera[ng	
  systems:	
  

Page

0x56780000

Page

0x12340000

Page
0x00000000
0x00000000
Process1

0x00000000
Process2

0x00000000

GPU
THE	
  TYPICAL	
  APPLICATION’S	
  VIEW	
  OF	
  MEMORY	
  (3)	
  

NOW	
  LET’S	
  SEE	
  HOW	
  A	
  GPU	
  SEES	
  THAT	
  SAME	
  MEMORY	
  TODAY	
  AND	
  ADDS	
  TO	
  IT…	
  

6	
   |	
  	
  	
  BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  |	
  	
  	
  NOVEMBER	
  13,	
  2013	
  |	
  APU13	
  

Mapped	
  via
GPU	
  MMU

0x98765000

Page
Page

2 47-­‐1

0x00000000

FB
Aperture

2 47-­‐1
System	
  Physical	
  
Memory	
  Space
2 44-­‐1
Page

0x78900000

Page

‒  depending	
  on	
  the	
  system	
  configura[on	
  (e.g.	
  PCIe	
  bus	
  access)	
  

Page

GPU
Buffer

Page

Page

Allocation

User	
  Process
	
  Space

‒  GPU	
  caches	
  are	
  typically	
  	
  explicitly	
  managed	
  by	
  the	
  driver	
  and	
  need	
  to	
  be	
  
refreshed	
  when	
  the	
  CPU	
  updates	
  memory	
  content	
  
‒  One	
  reason	
  is	
  hardware	
  complexity	
  to	
  make	
  this	
  performant	
  
‒  Depending	
  on	
  use	
  scenario,	
  the	
  GPU	
  accessible	
  memory	
  is	
  mapped	
  as	
  
“writethrough”,	
  “uncached”	
  or	
  “writecombined”	
  by	
  the	
  OS	
  APIs	
  

GPU	
  physical
memory
(e.g.	
  discrete)

2 48-­‐1

!  Data	
  visibility	
  (cache	
  coherency)	
  is	
  typically	
  soiware-­‐managed	
  
‒  CPU	
  cache	
  coherency,	
  when	
  accessing	
  	
  system	
  memory	
  poten[ally	
  
updated	
  by	
  a	
  GPU	
  may	
  not	
  be	
  always	
  guaranteed	
  

Mapped	
  via
CPU	
  MMU

Kernel	
  
Mode	
  
Address
Space	
  

‒  For	
  frequent	
  accesses	
  from	
  both	
  CPU	
  &	
  GPU,	
  the	
  transla[on	
  can	
  be	
  tediously	
  
slow	
  
‒  Content	
  that	
  can	
  be	
  accessed	
  by	
  both	
  CPU	
  and	
  GPU	
  simultaneously	
  needs	
  
data	
  visibility/coherency	
  rules	
  leading	
  to	
  the	
  next	
  issue…	
  

Managed	
  by
Gfx	
  Driver

Managed	
  by	
  OS

Framebuffer

‒  The	
  bad	
  thing	
  about	
  it	
  is	
  that	
  it’s	
  an	
  either/or	
  style	
  access	
  

GPU
Virtual	
  Address	
  Space

Alloc.
Gfx

2 64-­‐1
Noncanonical	
  VA	
  Range

‒  where	
  it	
  can	
  be	
  more	
  efficiently	
  stored	
  or	
  processed	
  (e.g.	
  2D	
  [ling)	
  

Process	
  VA
Space	
  (CPU)

GPU
Buffer

‒  The	
  good	
  thing	
  about	
  an	
  API	
  controlled	
  access	
  is	
  that	
  the	
  OS	
  and	
  &	
  driver	
  
can	
  copy	
  the	
  content	
  to	
  someplace	
  else	
  and/or	
  to	
  a	
  different	
  format	
  

Page

0x56780000

Page

0x12340000

Page
0x00000000
0x00000000
Process1

0x00000000
Process2

0x00000000

GPU
IT’S	
  ALL	
  ABOUT	
  THROUGHPUT,	
  BANDWIDTH	
  AND	
  LATENCY…	
  
KEEP	
  YOUR	
  DATA	
  CLOSE	
  AND	
  YOUR	
  FREQUENTLY	
  USED	
  DATA	
  EVEN	
  CLOSER…	
  
LDS = Local Data Share
TU = Texture Unit
TC = Texture Cache

Bandwidth: 100's GB/s
Latency: <1-10's cycles

Bandwidth: 100's-1000's GB/s
Latency: <1-10's cycles
Discrete GPU

Accelerated Processing Unit (APU)
CPU

GPU
Instruction Cache

CoreM
DC (L1)

Global Data Share

Constant Cache

Global Data Share

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

LDS

IC, FPU, L2

LDS
LDS

1..N Compute Units Core1
Core0
IC, FPU, L2
DC (L1)
DC (L1)
Core0
Core1
DC (L1)
DC (L1)
IC, FPU, L2

H-CU Engine

TU

L1 (TC)

LDS

H-CU Engine

TU

L1 (TC)

~15 GB/s
X16 PCI-E 3.0

L2 Cache

L3

IOMMUv2

~17 GB/s
(DDR3-2133)

Mem

Memory Controller

Mem

Instruction Cache

~17 GB/s
(DDR3-2133)

Memory (DDR3)

Memory (DDR3)

Cached
non-cacheable

Cached
non-cacheable

7	
   |	
  	
  	
  BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  |	
  	
  	
  NOVEMBER	
  13,	
  2013	
  |	
  APU13	
  

L2 Cache

PCIE
Memory Controller

Latency: 10's-100's of cycles
(mem, bus)

Mem

CoreM-1
DC (L1)
1..N Compute Units

GPU

Constant Cache

~90GB/s
(3GHz MCLK)

Memory (GDDR5)

~90GB/s
(3GHz MCLK)

Memory (GDDR5)

Mem

1..N Compute Units

Mem

Caches:

~90GB/s
(3GHz MCLK)

Memory (GDDR5)
IT’S	
  ALL	
  ABOUT	
  THE	
  RIGHT	
  TOOL	
  FOR	
  THE	
  JOB(1)	
  
!  The	
  efficient	
  use	
  of	
  a	
  GPU	
  &	
  CPU	
  in	
  a	
  system	
  depends	
  understanding	
  their	
  opera[on	
  on	
  memory	
  
‒  The	
  cache	
  architecture	
  on	
  either	
  CPU	
  and	
  GPU	
  is	
  a	
  reflec[on	
  of	
  the	
  different	
  access	
  pauerns	
  for	
  their	
  “preferred”	
  
workloads	
  and	
  data	
  and	
  so	
  is	
  the	
  cache	
  management/op[miza[on	
  	
  

!  CPU’s	
  are	
  typically	
  built	
  to	
  operate	
  on	
  general	
  purpose,	
  serial	
  instruc[on	
  threads,	
  oien	
  high	
  data	
  locality,	
  
lot’s	
  of	
  condi[onal	
  execu[on	
  and	
  dealing	
  with	
  data	
  interdependency	
  
‒  CPU	
  cache	
  hierarchy	
  is	
  focused	
  on	
  general	
  purpose	
  data	
  access	
  from/to	
  execu[on	
  units,	
  feeding	
  back	
  previously	
  
computed	
  data	
  to	
  the	
  execu[on	
  units	
  with	
  very	
  low	
  latency	
  
‒  Compara[vely	
  few	
  registers	
  (vs	
  GPUs),	
  but	
  large	
  caches	
  keep	
  oien	
  used	
  “arbitrary”	
  data	
  close	
  to	
  the	
  execu[on	
  
units	
  	
  

!  GPUs	
  are	
  usually	
  built	
  for	
  a	
  SIMD	
  execu[on	
  model	
  	
  
‒  Apply	
  the	
  same	
  sequence	
  of	
  instruc[ons	
  over	
  and	
  over	
  on	
  data	
  with	
  liule	
  varia[on	
  but	
  high	
  throughput	
  (“streaming	
  
data”),	
  passing	
  the	
  data	
  from	
  one	
  processing	
  stage	
  to	
  another	
  (latency	
  tolerance)	
  
‒  Compute	
  units	
  have	
  a	
  rela[vely	
  large	
  register	
  file	
  store	
  
‒  Using	
  a	
  lot	
  of	
  “specialty	
  caches”	
  (constant	
  cache,	
  Texture	
  Cache,	
  etc),	
  data	
  caches	
  op[mized	
  for	
  SW	
  data	
  prefetch	
  
‒  LDS,	
  GDS	
  mainly	
  used	
  for	
  in-­‐wavefront	
  or	
  inter-­‐wavefront	
  updates	
  &	
  synchroniza[on	
  
‒  Data	
  caches	
  are	
  typically	
  explicitly	
  flushed	
  by	
  soiware	
  
8	
   |	
  	
  	
  BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  |	
  	
  	
  NOVEMBER	
  13,	
  2013	
  |	
  APU13	
  
IT’S	
  ALL	
  ABOUT	
  THE	
  RIGHT	
  TOOL	
  FOR	
  THE	
  JOB(2)	
  
!  The	
  GPU	
  memory	
  &	
  cache	
  access	
  	
  design	
  is	
  well-­‐suited	
  for	
  typical	
  2D	
  
&	
  3D	
  graphics	
  workloads	
  (duh!)	
  
‒  Ver[ces	
  data,	
  Textures,	
  etc	
  are	
  passed	
  from	
  the	
  host	
  to	
  the	
  various	
  stages	
  
of	
  the	
  graphics	
  API	
  pipeline,	
  with	
  each	
  stage	
  allowing	
  processing	
  of	
  the	
  
data	
  passing	
  through	
  via	
  appropriate	
  instruc[on	
  sequences	
  (“shaders”)	
  
‒  Since	
  a	
  lot	
  of	
  the	
  data	
  is	
  “sta[c”	
  and	
  the	
  access	
  is	
  abstracted	
  via	
  APIs,	
  it	
  can	
  
be	
  put	
  into	
  beuer	
  suited	
  data	
  formats	
  mapping	
  2D/3D	
  pixel	
  coordinates	
  
“locality”	
  to	
  memory	
  locality	
  in	
  internal	
  buffers	
  within	
  the	
  graphics	
  pipeline	
  	
  

2D	
  Tiling	
  
Y-­‐Coordinate

X-­‐Coordinate

16x16

16x16

16x16

16x16

16x16

...

‒  Very	
  beneficial	
  for	
  performance,	
  but	
  not	
  easily	
  “accessible”	
  by	
  simple	
  addressing	
  
schemes,	
  requires	
  copy	
  of	
  the	
  data	
  first	
  

‒  Today’s	
  graphics	
  APIs	
  (OpenGL,	
  Direct3D	
  are	
  well	
  suited	
  for	
  this	
  workload,	
  
but	
  oien	
  must	
  focus	
  on	
  the	
  lowest-­‐common	
  denominator	
  in	
  hardware	
  
capabili[es	
  
‒  The	
  API	
  design	
  assumes	
  that	
  no	
  cache	
  coherency	
  between	
  CPU	
  and	
  GPU	
  
may	
  exist,	
  requiring	
  the	
  CPU	
  to	
  issue	
  explicit	
  cache	
  flushes	
  or	
  operate	
  on	
  
memory	
  areas	
  mapped	
  as	
  “uncached”	
  if	
  readback	
  of	
  GPU	
  data	
  is	
  required	
  
‒  Some	
  extensions	
  or	
  recently	
  introduced	
  features	
  for	
  “zero	
  copy”	
  memory	
  
9	
   |	
  	
  	
  BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  |	
  	
  	
  NOVEMBER	
  13,	
  2013	
  |	
  APU13	
  

X0,Y0

X1,Y0

X2,Y0

Memory	
  Addresses

...

X15,Y0

X0,Y1

X1,Y1

X2,Y1

...

X15,Y14

X0,Y15

X1,Y15

X2,Y15

...
IT’S	
  ALL	
  ABOUT	
  THE	
  RIGHT	
  TOOL	
  FOR	
  THE	
  JOB(3)	
  
!  Vector/Matrix-­‐oriented	
  compute	
  workloads	
  map	
  well	
  	
  to	
  GPUs,	
  but	
  un[l	
  now	
  “suffer”	
  from	
  some	
  of	
  the	
  
choices	
  that	
  benefit	
  the	
  graphics	
  data	
  processing	
  flow	
  
‒  Compute	
  APIs	
  like	
  OpenCL™	
  or	
  DirectCompute	
  are	
  oien	
  s[ll	
  inherently	
  [ed	
  to	
  the	
  low-­‐level	
  graphics	
  focused	
  GPU	
  
infrastructure	
  in	
  today’s	
  OS	
  (e.g.	
  memory	
  management	
  through	
  Microsoi®	
  WDDM,	
  Linux®	
  TTM/GEM)	
  
‒  “Zero	
  Copy”	
  Support	
  and	
  system	
  memory	
  buffer	
  cache	
  coherency	
  in	
  recent	
  API’s	
  improves	
  the	
  behavior	
  on	
  some	
  
pla{orms	
  that	
  have	
  appropriate	
  support,	
  s[ll	
  has	
  some	
  SW	
  overhead	
  for	
  access	
  
‒  All	
  the	
  memory	
  processed	
  by	
  the	
  GPU	
  is	
  referenced	
  through	
  handles	
  to	
  control	
  memory	
  page-­‐lock	
  on	
  workload	
  
dispatch	
  and	
  the	
  SW	
  needs	
  to	
  create	
  “Buffer	
  views”	
  either	
  explicitly	
  or	
  under	
  the	
  covers	
  to	
  access	
  regular	
  memory	
  
‒  There	
  is	
  quite	
  some	
  SW	
  overhead	
  involved	
  in	
  that	
  

!  Discrete	
  GPU	
  have	
  excellent	
  compute	
  performance	
  (several	
  TeraFLOPS	
  for	
  even	
  mid-­‐range	
  cards)	
  
‒  But	
  require	
  the	
  data	
  to	
  be	
  accessible	
  in	
  local	
  memory	
  for	
  best	
  performance,	
  requiring	
  copy-­‐opera[ons	
  from	
  host	
  
memory	
  and	
  “keeping	
  the	
  data	
  on	
  the	
  other	
  side”	
  as	
  long	
  as	
  possible	
  	
  
‒  Accessing	
  or	
  pushing	
  the	
  data	
  back	
  and	
  forth	
  through	
  the	
  PCIe	
  bouleneck	
  may	
  reduce	
  or	
  eliminate	
  speedup-­‐gains	
  or	
  
increases	
  access	
  latency	
  from	
  host	
  substan[ally	
  

	
  
10	
   |	
  	
  	
  BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  |	
  	
  	
  NOVEMBER	
  13,	
  2013	
  |	
  APU13	
  
HOW	
  DOES	
  HUMA	
  	
  AND	
  HSA	
  CHANGE	
  THINGS	
  ?	
  

‒  Pla{orm	
  atomics	
  are	
  supported,	
  for	
  efficient	
  synchroniza[on	
  
11	
   |	
  	
  	
  BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  |	
  	
  	
  NOVEMBER	
  13,	
  2013	
  |	
  APU13	
  

Alloc.
Gfx

Framebuffer

Noncanonical	
  VA	
  Range

0x98765000

Page
Page

2 47-­‐1

0x00000000

FB
Aperture

2 47-­‐1
System	
  Physical	
  
Memory	
  Space
2 -­‐1

2 47-­‐1
FB
Aperture

2 47-­‐1

Mapped	
  via
HSA	
  MMU

GPU	
  physical
memory
(e.g.	
  discrete)

44

Page
0x78900000

GPU
Buffer

0x78900000

GPU
Buffer

Page

Page
Page

Page
Page
Page

0x12340000

User	
  Process
	
  Space

‒  A	
  data	
  pointer	
  has	
  the	
  same	
  “meaning”	
  (=	
  points	
  to	
  the	
  same	
  content)	
  in	
  
system	
  memory	
  (also	
  known	
  as	
  “ptr-­‐is-­‐ptr”)	
  
‒  On	
  OS	
  that	
  support	
  HSA	
  MMU	
  func[onality,	
  the	
  page	
  tables	
  	
  may	
  be	
  even	
  
shared	
  and	
  the	
  OS	
  may	
  support	
  na[ve	
  GPU	
  demand	
  	
  paging	
  	
  
‒  The	
  GPU	
  may	
  s[ll	
  support	
  addi[onal	
  address	
  ranges	
  for	
  special	
  purposes	
  	
  
(e.g.	
  frame	
  buffer	
  memory,	
  LDS,	
  scratch,	
  …)	
  	
  

Managed	
  by
OS	
  &	
  Gfx	
  driver

Allocation

‒  The	
  GPU’s	
  virtual	
  address	
  page	
  table	
  mapping	
  is	
  set	
  to	
  a	
  process	
  
address	
  view	
  of	
  the	
  memory	
  space	
  

Mapped	
  via
CPU	
  MMU

Allocation

‒  Reads	
  and	
  updates	
  of	
  system	
  memory	
  from	
  one	
  will	
  cause	
  	
  cache	
  line	
  
flushes	
  or	
  line	
  invalida[on	
  on	
  the	
  other	
  processors	
  in	
  the	
  system	
  
‒  SW	
  does	
  not	
  have	
  to	
  deal	
  with	
  explicit	
  cache	
  line	
  flushes	
  or	
  invalida[ons	
  
for	
  such	
  transac[ons	
  anymore,	
  it	
  works	
  like	
  for	
  any	
  CPU	
  core	
  in	
  the	
  system	
  
‒  This	
  fully	
  works	
  for	
  APUs,	
  where	
  GPU	
  and	
  CPU	
  have	
  access	
  to	
  the	
  same	
  
system	
  memory	
  controller,	
  par[al	
  support	
  for	
  discrete	
  GPU	
  

GPU
Virtual	
  Address	
  Space
Managed	
  by	
  OS

Kernel	
  
Mode	
  
Address
Space	
  

‒  It’s	
  the	
  same	
  layout,	
  just	
  a	
  different	
  visualiza[on	
  (focus	
  on	
  bit47	
  ☺)	
  
‒  There	
  is	
  efficient	
  hardware	
  support	
  for	
  GPU	
  &	
  CPU	
  cache	
  coherency	
  
on	
  memory	
  load/store	
  opera[ons	
  by	
  the	
  GPU	
  	
  

Process	
  VA
Space	
  (CPU)
2 64-­‐1

User	
  Process
	
  Space

!  First,	
  let’s	
  redraw	
  the	
  address	
  layout	
  map	
  from	
  before…	
  

0x12340000
Page
0x00000000

0x00000000
Process1

0x00000000
Process2

0x00000000
Process1

0x00000000
Process2

GPU
THERE	
  ARE	
  STILL	
  REASONS	
  FOR	
  THE	
  “BUFFERED	
  VIEW”	
  OF	
  MEMORY	
  
!  HSA	
  and	
  hUMA	
  are	
  very	
  useful	
  for	
  compute	
  jobs	
  and	
  graphics	
  data	
  oien	
  updated	
  by	
  the	
  host	
  CPU	
  
‒  Allows	
  fine-­‐grained	
  “interac[ve”	
  sharing	
  of	
  data	
  between	
  CPU	
  and	
  GPU	
  threads	
  without	
  requiring	
  prophylac[c	
  
cache	
  flushes	
  and	
  other	
  synchroniza[on	
  

!  But	
  the	
  “direct	
  view”	
  and	
  access	
  to	
  common	
  memory	
  is	
  less	
  beneficial	
  for	
  other	
  graphics	
  data	
  
‒  Many	
  graphics	
  algorithms	
  have	
  been	
  designed	
  with	
  an	
  “abstract”	
  or	
  “deferred”	
  view	
  of	
  memory,	
  focusing	
  on	
  
“dimensional	
  addressing”	
  of	
  the	
  data	
  in	
  the	
  shaders	
  (e.g.	
  x/y/z,	
  u/w	
  coordinates)	
  
‒  Many	
  GPUs	
  use	
  hardware-­‐specific	
  texture	
  [ling	
  formats	
  that	
  are	
  op[mized	
  for	
  a	
  specific	
  memory	
  channel	
  layout	
  to	
  
reach	
  maximum	
  performance,	
  complicated	
  to	
  address	
  by	
  soiware	
  in	
  a	
  general	
  way	
  	
  
‒  An	
  applica[on	
  may	
  have	
  mul[ple	
  graphics	
  contexts	
  concurrently	
  per	
  process	
  (per	
  API),	
  vs	
  just	
  one	
  for	
  “flat”	
  
‒  A	
  lot	
  of	
  graphics	
  data	
  (e.g.	
  textures,	
  ver[ces,	
  et	
  al)	
  are	
  not	
  changing	
  oien	
  through	
  CPU	
  updates	
  
‒  requiring	
  cache	
  coherency	
  increases	
  HW	
  access	
  overhead	
  for	
  liule	
  benefit	
  
‒  Many	
  specialty	
  resources	
  (e.g.	
  Z-­‐Buffer)	
  have	
  GPU-­‐specific	
  implementa[on	
  with	
  no	
  “external”	
  visibility	
  

‒  Leveraging	
  the	
  much	
  higher	
  performance	
  of	
  a	
  Discrete	
  GPU	
  and	
  its	
  frame	
  buffer	
  memory	
  is	
  somewhat	
  more	
  
complicated,	
  if	
  an	
  applica[on	
  needs	
  to	
  deal	
  with	
  the	
  memory	
  loca[on	
  directly	
  

!  	
  Most	
  common	
  graphics	
  APIs	
  today	
  don’t	
  know	
  how	
  to	
  deal	
  with	
  virtual	
  addresses	
  
‒  This	
  will	
  change	
  in	
  the	
  future	
  as	
  u[lizing	
  virtual	
  addresses	
  within	
  graphics	
  APIs	
  becomes	
  commonplace	
  
12	
   |	
  	
  	
  BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  |	
  	
  	
  NOVEMBER	
  13,	
  2013	
  |	
  APU13	
  
GRAPHICS	
  INTEROPERATION	
  IS	
  IMPORTANT	
  
!  There	
  are	
  many	
  different	
  graphics/GPU	
  APIs	
  in	
  use,	
  using	
  buffers/resources	
  to	
  access	
  memory	
  
‒  As	
  seen	
  before,	
  there	
  are	
  good	
  reasons	
  to	
  keep	
  the	
  content	
  in	
  “buffers”	
  either	
  due	
  to	
  legacy	
  or	
  performance	
  
‒  It	
  also	
  may	
  not	
  make	
  sense	
  to	
  “waste”	
  virtual	
  address	
  space	
  e.g.	
  on	
  32bit	
  apps	
  for	
  resources	
  not	
  accessed	
  by	
  host	
  
‒  But	
  this	
  may	
  also	
  makes	
  it	
  harder	
  to	
  access	
  the	
  content	
  	
  from	
  either	
  CPU	
  or	
  through	
  a	
  “flat	
  addressing”	
  aware	
  GPU	
  

!  Explicit	
  interopera[on	
  APIs	
  to	
  tradi[onal	
  graphics	
  APIs	
  provides	
  two	
  views	
  of	
  a	
  resource	
  
‒  The	
  transla[on	
  between	
  “handle	
  +	
  offset”	
  and	
  “flat	
  address”	
  is	
  dealt	
  within	
  the	
  run[me	
  and	
  driver	
  
‒  The	
  transla[on	
  itself	
  may	
  be	
  straigh{orward	
  and	
  very	
  efficient	
  however	
  

!  Specialty	
  GPU	
  resources	
  (e.g.	
  LDS,	
  scratch)	
  may	
  be	
  mapped	
  into	
  the	
  “flat”	
  process	
  address	
  space,	
  but	
  may	
  
not	
  be	
  accessible	
  by	
  the	
  CPU	
  host	
  since	
  they’re	
  not	
  accessible	
  from	
  the	
  “outside”	
  
‒  This	
  is	
  no	
  different	
  than	
  some	
  other	
  system	
  memory	
  mappings	
  	
  provided	
  by	
  the	
  OS	
  

!  Applica[ons	
  should	
  focus	
  on	
  an	
  efficient	
  processing	
  of	
  the	
  data	
  on	
  the	
  “compute”	
  with	
  a	
  dedicated	
  
handover	
  to	
  the	
  “graphics”	
  side	
  when	
  appropriate	
  
‒  As	
  graphics	
  APIs	
  are	
  updated	
  over	
  [me	
  to	
  take	
  advantage	
  of	
  flat	
  addressing	
  models	
  (e.g.	
  for	
  “bindless	
  textures”)	
  the	
  
need	
  for	
  the	
  interopera[on	
  mechanisms	
  may	
  gradually	
  vanish	
  for	
  most	
  graphics	
  data	
  

13	
   |	
  	
  	
  BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  |	
  	
  	
  NOVEMBER	
  13,	
  2013	
  |	
  APU13	
  
ADDITIONAL	
  CONSIDERATIONS	
  
!  A	
  lot	
  of	
  today’s	
  PC	
  systems	
  have	
  more	
  than	
  one	
  GPU	
  available	
  to	
  the	
  programmer	
  
‒  Almost	
  all	
  of	
  todays	
  CPUs	
  are	
  actually	
  APUs	
  and	
  have	
  both	
  CPU	
  and	
  GPU	
  on	
  chip,	
  using	
  the	
  same	
  memory	
  controller	
  
‒  On	
  performance	
  systems,	
  a	
  discrete	
  GPU	
  with	
  dedicated	
  frame	
  buffer	
  memory	
  may	
  be	
  present,	
  too	
  

!  The	
  integrated	
  GPU	
  may	
  support	
  cache	
  coherency	
  for	
  system	
  memory	
  updates	
  and	
  therefore	
  is	
  
preferen[al	
  for	
  GPU	
  compute	
  tasks	
  via	
  e.g.	
  DirectCompute	
  or	
  OpenCL™	
  
‒  The	
  performance	
  uplii	
  vs	
  CPU	
  may	
  differ,	
  but	
  typically	
  there	
  oien	
  is	
  a	
  >10	
  [mes	
  factor	
  for	
  vector	
  computa[ons	
  vs	
  
equivalent	
  CPU	
  instruc[ons	
  

!  Discrete	
  GPU	
  can	
  focus	
  on	
  graphics	
  workload	
  accelera[on,	
  further	
  processing	
  the	
  data	
  pre-­‐processed	
  by	
  
either	
  host	
  CPU	
  or	
  integrated	
  GPU	
  for	
  further	
  uplii	
  
‒  Dedicated	
  transfer	
  from/to	
  discrete	
  GPU	
  frame	
  buffer	
  
‒  For	
  appropriate	
  compute	
  workloads,	
  consider	
  the	
  addi[onal	
  performance	
  uplii	
  through	
  compute	
  on	
  discrete	
  GPU	
  

!  The	
  controls	
  may	
  be	
  in	
  a	
  driver	
  as	
  part	
  of	
  collabora[ve	
  render	
  (e.g.	
  AMD	
  DualGraphics)	
  where	
  the	
  
compute	
  processing	
  on	
  the	
  integrated	
  GPU	
  via	
  appropriate	
  APIs	
  interoperates	
  with	
  the	
  “graphics”	
  device	
  
‒  The	
  graphics	
  driver	
  operates	
  in	
  a	
  “Crossfire”	
  mode	
  for	
  integrated	
  and	
  discrete	
  GPU	
  
‒  Whereas	
  the	
  compute	
  device	
  operates	
  on	
  a	
  DirectCompute	
  or	
  OpenCL™	
  “device”	
  on	
  the	
  integrated	
  GPU	
  
14	
   |	
  	
  	
  BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  |	
  	
  	
  NOVEMBER	
  13,	
  2013	
  |	
  APU13	
  
SUMMARY	
  
!  HSA	
  and	
  hUMA	
  substan[ally	
  simplifies	
  data	
  exchange	
  between	
  GPU	
  and	
  CPU,	
  processing	
  it	
  on	
  both	
  sides	
  
‒  Benefits	
  from	
  a	
  flat	
  address	
  model	
  where	
  data	
  pointer	
  references	
  to	
  content	
  can	
  be	
  resolved	
  on	
  either	
  side	
  	
  
‒  It	
  works	
  best	
  for	
  compute-­‐heavy	
  workload,	
  where	
  frequent	
  data	
  updates	
  and	
  result	
  retrieval	
  is	
  important	
  

!  There	
  are	
  s[ll	
  benefits	
  to	
  keep	
  some	
  graphics	
  data	
  in	
  a	
  “buffered”	
  address	
  mode	
  through	
  graphics	
  APIs	
  
‒  Leverages	
  “specialty	
  caches”,	
  discrete	
  GPU	
  and	
  storage	
  within	
  the	
  GPU	
  that	
  is	
  op[mized	
  for	
  graphics	
  data	
  but	
  
makes	
  it	
  “less	
  accessible”	
  for	
  CPU	
  host	
  access	
  

!  With	
  appropriate,	
  efficient	
  interopera[on	
  between	
  the	
  “buffered”	
  and	
  the	
  “flat”	
  resource	
  view	
  on	
  the	
  
GPU	
  the	
  applica[on	
  can	
  easily	
  traverse	
  between	
  these	
  two	
  data	
  representa[ons	
  
‒  An	
  HSA	
  compliant	
  GPU	
  allows	
  for	
  a	
  very	
  efficient	
  transla[on	
  between	
  these	
  two	
  representa[ons	
  
‒  Current	
  compute	
  &	
  graphics	
  API’s	
  can	
  be	
  supported	
  in	
  this	
  scheme	
  	
  
‒  With	
  na[ve	
  support	
  for	
  a	
  “flat	
  model”	
  in	
  upcoming	
  modern	
  OS,	
  direct,	
  “flat”,	
  cache	
  coherent	
  references	
  to	
  memory	
  
resources	
  will	
  become	
  easier	
  to	
  use	
  directly	
  over	
  [me,	
  reducing	
  the	
  need	
  for	
  explicit	
  transla[on	
  

!  Take	
  advantage	
  of	
  all	
  the	
  GPUs	
  and	
  all	
  the	
  memory	
  you	
  find	
  on	
  a	
  system!	
  
‒  There’s	
  oien	
  more	
  than	
  one	
  and	
  all	
  have	
  their	
  advantages	
  

15	
   |	
  	
  	
  BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  |	
  	
  	
  NOVEMBER	
  13,	
  2013	
  |	
  APU13	
  
WHERE	
  TO	
  FIND	
  MORE	
  INFORMATION	
  
THIS	
  PRESENTATION	
  IS	
  ONLY	
  A	
  START…	
  

!  AMD	
  Accelerated	
  Parallel	
  Processing	
  (APP)	
  SDK:	
  
‒  hup://developer.amd.com/tools-­‐and-­‐sdks/heterogeneous-­‐compu[ng/amd-­‐accelerated-­‐parallel-­‐processing-­‐app-­‐sdk/	
  

‒  AMD	
  APP	
  SDK	
  is	
  a	
  complete	
  	
  development	
  pla{orm,	
  providing	
  samples,	
  documenta[on	
  and	
  other	
  materials	
  to	
  
quickly	
  get	
  you	
  started	
  using	
  OpenCL™,	
  Bolt	
  (Open	
  Source	
  C++	
  Template	
  	
  Library	
  for	
  	
  GPU	
  parallel	
  processing),	
  	
  C+
+AMP	
  or	
  Aparapi	
  for	
  Java	
  applica[ons	
  

!  AMD	
  CodeXL:	
  
‒  hup://developer.amd.com/tools-­‐and-­‐sdks/heterogeneous-­‐compu[ng/codexl/	
  
‒  A	
  powerful	
  tools	
  suite	
  for	
  Windows®	
  and	
  Linux®	
  heterogeneous	
  applica[on	
  	
  debugging	
  and	
  profiling	
  
‒  Works	
  standalone	
  and	
  	
  e.g.	
  	
  Integrated	
  as	
  Visual	
  Studio	
  extension	
  

!  AMD	
  Developer	
  Central:	
  hup://developer.amd.com	
  
‒  Docs,	
  whitepapers,	
  tools;	
  Everything	
  you	
  want	
  to	
  know	
  and	
  need	
  to	
  write	
  performant	
  programs	
  	
  on	
  heterogeneous	
  
systems.	
  
‒  It’s	
  not	
  about	
  either	
  CPU	
  or	
  GPU,	
  its	
  about	
  both…	
  

16	
   |	
  	
  	
  BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  |	
  	
  	
  NOVEMBER	
  13,	
  2013	
  |	
  APU13	
  
GO	
  AHEAD	
  ☺	
  

17	
   |	
  	
  	
  BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  |	
  	
  	
  NOVEMBER	
  13,	
  2013	
  |	
  APU13	
  
DISCLAIMER	
  &	
  ATTRIBUTION	
  
The	
  informa[on	
  presented	
  in	
  this	
  document	
  is	
  for	
  informa[onal	
  purposes	
  only	
  and	
  may	
  contain	
  technical	
  inaccuracies,	
  omissions	
  and	
  typographical	
  errors.	
  
	
  
The	
  informa[on	
  contained	
  herein	
  is	
  subject	
  to	
  change	
  and	
  may	
  be	
  rendered	
  inaccurate	
  for	
  many	
  reasons,	
  including	
  but	
  not	
  limited	
  to	
  product	
  and	
  roadmap	
  
changes,	
  component	
  and	
  motherboard	
  version	
  changes,	
  new	
  model	
  and/or	
  product	
  releases,	
  product	
  differences	
  between	
  differing	
  manufacturers,	
  soiware	
  
changes,	
  BIOS	
  flashes,	
  firmware	
  upgrades,	
  or	
  the	
  like.	
  AMD	
  assumes	
  no	
  obliga[on	
  to	
  update	
  or	
  otherwise	
  correct	
  or	
  revise	
  this	
  informa[on.	
  However,	
  AMD	
  
reserves	
  the	
  right	
  to	
  revise	
  this	
  informa[on	
  and	
  to	
  make	
  changes	
  from	
  [me	
  to	
  [me	
  to	
  the	
  content	
  hereof	
  without	
  obliga[on	
  of	
  AMD	
  to	
  no[fy	
  any	
  person	
  of	
  
such	
  revisions	
  or	
  changes.	
  
	
  
AMD	
  MAKES	
  NO	
  REPRESENTATIONS	
  OR	
  WARRANTIES	
  WITH	
  RESPECT	
  TO	
  THE	
  CONTENTS	
  HEREOF	
  AND	
  ASSUMES	
  NO	
  RESPONSIBILITY	
  FOR	
  ANY	
  
INACCURACIES,	
  ERRORS	
  OR	
  OMISSIONS	
  THAT	
  MAY	
  APPEAR	
  IN	
  THIS	
  INFORMATION.	
  
	
  
AMD	
  SPECIFICALLY	
  DISCLAIMS	
  ANY	
  IMPLIED	
  WARRANTIES	
  OF	
  MERCHANTABILITY	
  OR	
  FITNESS	
  FOR	
  ANY	
  PARTICULAR	
  PURPOSE.	
  IN	
  NO	
  EVENT	
  WILL	
  AMD	
  BE	
  
LIABLE	
  TO	
  ANY	
  PERSON	
  FOR	
  ANY	
  DIRECT,	
  INDIRECT,	
  SPECIAL	
  OR	
  OTHER	
  CONSEQUENTIAL	
  DAMAGES	
  ARISING	
  FROM	
  THE	
  USE	
  OF	
  ANY	
  INFORMATION	
  
CONTAINED	
  HEREIN,	
  EVEN	
  IF	
  AMD	
  IS	
  EXPRESSLY	
  ADVISED	
  OF	
  THE	
  POSSIBILITY	
  OF	
  SUCH	
  DAMAGES.	
  
	
  
ATTRIBUTION	
  
©	
  2013	
  Advanced	
  Micro	
  Devices,	
  Inc.	
  All	
  rights	
  reserved.	
  AMD,	
  the	
  AMD	
  Arrow	
  logo	
  and	
  combina[ons	
  thereof	
  are	
  trademarks	
  of	
  Advanced	
  Micro	
  Devices,	
  
Inc.	
  in	
  the	
  United	
  States	
  and/or	
  other	
  jurisdic[ons.	
  OpenCL	
  is	
  a	
  trademark	
  of	
  Apple	
  Corp.	
  and	
  Linux	
  is	
  a	
  trademark	
  of	
  Linus	
  Torvalds	
  and	
  Microsoi	
  is	
  a	
  
trademark	
  of	
  Microsoi	
  Corp.	
  PCI	
  Express	
  is	
  a	
  trademark	
  of	
  PCI	
  SIG	
  Corpora[on.	
  Other	
  names	
  are	
  for	
  informa[onal	
  purposes	
  only	
  and	
  may	
  be	
  trademarks	
  of	
  
their	
  respec[ve	
  owners.	
  
18	
   |	
  	
  	
  BEING	
  SPECIAL	
  IN	
  A	
  UNIFIED	
  MEMORY	
  WORLD	
  |	
  	
  	
  NOVEMBER	
  13,	
  2013	
  |	
  APU13	
  

More Related Content

What's hot

PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU deviceKohei KaiGai
 
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...AMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Saksham Tanwar
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...AMD Developer Central
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorAMD Developer Central
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit pptNitesh Dubey
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013 HSA Foundation
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)Fatima Qayyum
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 

What's hot (20)

Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU device
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
HSA Features
HSA FeaturesHSA Features
HSA Features
 
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit ppt
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 

Viewers also liked

Circular x ley de contrata
Circular x ley de contrataCircular x ley de contrata
Circular x ley de contrataRamón Vargas
 
Grandes escritores, 2010 11
Grandes escritores, 2010 11Grandes escritores, 2010 11
Grandes escritores, 2010 11Juan Betancur
 
Canalizar como entrar en contacto con su guia
Canalizar  como entrar en contacto con su guiaCanalizar  como entrar en contacto con su guia
Canalizar como entrar en contacto con su guiaPatita De Loro
 
Benjamin mitchell agile x
Benjamin mitchell   agile xBenjamin mitchell   agile x
Benjamin mitchell agile xSkills Matter
 
Debate on Horizon 2020 and Space Research at the European Parliament
Debate on Horizon 2020 and Space Research at the European ParliamentDebate on Horizon 2020 and Space Research at the European Parliament
Debate on Horizon 2020 and Space Research at the European ParliamentAxelle Pomies
 
Red de centros asistenciales en caso de accidente trabajo en la comunidad de ...
Red de centros asistenciales en caso de accidente trabajo en la comunidad de ...Red de centros asistenciales en caso de accidente trabajo en la comunidad de ...
Red de centros asistenciales en caso de accidente trabajo en la comunidad de ...Mantonia Sanchez
 
Andrea Ferronato Indagine di Relationship Marketing (Ateco 62)
Andrea Ferronato Indagine di Relationship Marketing (Ateco 62)Andrea Ferronato Indagine di Relationship Marketing (Ateco 62)
Andrea Ferronato Indagine di Relationship Marketing (Ateco 62)andreaferronato89
 
Dossier presentacion ingenia
Dossier presentacion ingeniaDossier presentacion ingenia
Dossier presentacion ingeniaIngenia Ps
 
SEOGuardian - Sector Mudanzas en España - 6 meses después
SEOGuardian - Sector Mudanzas en España - 6 meses despuésSEOGuardian - Sector Mudanzas en España - 6 meses después
SEOGuardian - Sector Mudanzas en España - 6 meses despuésBint
 
British folklore posters
British folklore postersBritish folklore posters
British folklore postersLola Meseguer
 
Distribución de energia (samuel ramirez castaño)
Distribución de energia (samuel ramirez castaño)Distribución de energia (samuel ramirez castaño)
Distribución de energia (samuel ramirez castaño)tataleo2114AB
 
10 Inovatii ce vor deveni trenduri in 2011
10 Inovatii ce vor deveni trenduri in 201110 Inovatii ce vor deveni trenduri in 2011
10 Inovatii ce vor deveni trenduri in 2011Gabriela Enescu
 
La Tierra Juan Pablo Portela
La Tierra Juan Pablo PortelaLa Tierra Juan Pablo Portela
La Tierra Juan Pablo Portelapoloman
 
Finn's Tenrec Research
Finn's Tenrec ResearchFinn's Tenrec Research
Finn's Tenrec Researchmrhaneyrhes
 
Muerte en el priorato, trabajo individual
Muerte en el priorato, trabajo individualMuerte en el priorato, trabajo individual
Muerte en el priorato, trabajo individualjabiaf
 

Viewers also liked (20)

Reptiles Por Eve Rivera
Reptiles Por Eve RiveraReptiles Por Eve Rivera
Reptiles Por Eve Rivera
 
Circular x ley de contrata
Circular x ley de contrataCircular x ley de contrata
Circular x ley de contrata
 
Basic computer
Basic computerBasic computer
Basic computer
 
Grandes escritores, 2010 11
Grandes escritores, 2010 11Grandes escritores, 2010 11
Grandes escritores, 2010 11
 
Canalizar como entrar en contacto con su guia
Canalizar  como entrar en contacto con su guiaCanalizar  como entrar en contacto con su guia
Canalizar como entrar en contacto con su guia
 
Benjamin mitchell agile x
Benjamin mitchell   agile xBenjamin mitchell   agile x
Benjamin mitchell agile x
 
Debate on Horizon 2020 and Space Research at the European Parliament
Debate on Horizon 2020 and Space Research at the European ParliamentDebate on Horizon 2020 and Space Research at the European Parliament
Debate on Horizon 2020 and Space Research at the European Parliament
 
Red de centros asistenciales en caso de accidente trabajo en la comunidad de ...
Red de centros asistenciales en caso de accidente trabajo en la comunidad de ...Red de centros asistenciales en caso de accidente trabajo en la comunidad de ...
Red de centros asistenciales en caso de accidente trabajo en la comunidad de ...
 
Andrea Ferronato Indagine di Relationship Marketing (Ateco 62)
Andrea Ferronato Indagine di Relationship Marketing (Ateco 62)Andrea Ferronato Indagine di Relationship Marketing (Ateco 62)
Andrea Ferronato Indagine di Relationship Marketing (Ateco 62)
 
Dossier presentacion ingenia
Dossier presentacion ingeniaDossier presentacion ingenia
Dossier presentacion ingenia
 
SEOGuardian - Sector Mudanzas en España - 6 meses después
SEOGuardian - Sector Mudanzas en España - 6 meses despuésSEOGuardian - Sector Mudanzas en España - 6 meses después
SEOGuardian - Sector Mudanzas en España - 6 meses después
 
Taller DCJ
Taller DCJTaller DCJ
Taller DCJ
 
Existenzminimum
ExistenzminimumExistenzminimum
Existenzminimum
 
British folklore posters
British folklore postersBritish folklore posters
British folklore posters
 
Distribución de energia (samuel ramirez castaño)
Distribución de energia (samuel ramirez castaño)Distribución de energia (samuel ramirez castaño)
Distribución de energia (samuel ramirez castaño)
 
10 Inovatii ce vor deveni trenduri in 2011
10 Inovatii ce vor deveni trenduri in 201110 Inovatii ce vor deveni trenduri in 2011
10 Inovatii ce vor deveni trenduri in 2011
 
La Tierra Juan Pablo Portela
La Tierra Juan Pablo PortelaLa Tierra Juan Pablo Portela
La Tierra Juan Pablo Portela
 
Finn's Tenrec Research
Finn's Tenrec ResearchFinn's Tenrec Research
Finn's Tenrec Research
 
Pimte
PimtePimte
Pimte
 
Muerte en el priorato, trabajo individual
Muerte en el priorato, trabajo individualMuerte en el priorato, trabajo individual
Muerte en el priorato, trabajo individual
 

Similar to HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

PERFORM MENSURATION AND CALCULATION OWN PPT.pptx
PERFORM MENSURATION AND CALCULATION OWN PPT.pptxPERFORM MENSURATION AND CALCULATION OWN PPT.pptx
PERFORM MENSURATION AND CALCULATION OWN PPT.pptxJVEE ARGUELLES
 
PERFORM CALCULATION AND MENSURATION.pptx
PERFORM CALCULATION AND MENSURATION.pptxPERFORM CALCULATION AND MENSURATION.pptx
PERFORM CALCULATION AND MENSURATION.pptxEchelleOgatis
 
Data Storage
Data StorageData Storage
Data Storageadil raja
 
Chapter 2 Hardware and Software
Chapter 2 Hardware and SoftwareChapter 2 Hardware and Software
Chapter 2 Hardware and SoftwareAnjan Mahanta
 
Computer Hardware And Configuration
Computer Hardware And ConfigurationComputer Hardware And Configuration
Computer Hardware And Configurationwaqar ahmed
 
Introduction to Computer Hardware slides ppt
Introduction to Computer Hardware slides pptIntroduction to Computer Hardware slides ppt
Introduction to Computer Hardware slides pptOsama Yousaf
 
00 Hardware Of Personal Computer V1 1
00 Hardware Of Personal Computer V1 100 Hardware Of Personal Computer V1 1
00 Hardware Of Personal Computer V1 1Rajan Das
 
DEFINE COMPUTER SYSTEM
DEFINE COMPUTER SYSTEMDEFINE COMPUTER SYSTEM
DEFINE COMPUTER SYSTEMM Kimi
 
Nt1310 Unit 3 Computer Components
Nt1310 Unit 3 Computer ComponentsNt1310 Unit 3 Computer Components
Nt1310 Unit 3 Computer ComponentsKristi Anderson
 
CS 3112 - First Assignment -Mark Bryan F. Ramirez/BSCS-3E
CS 3112 - First Assignment -Mark Bryan F. Ramirez/BSCS-3ECS 3112 - First Assignment -Mark Bryan F. Ramirez/BSCS-3E
CS 3112 - First Assignment -Mark Bryan F. Ramirez/BSCS-3EMark Bryan Ramirez
 
01. Basics of Computer Hardware
01. Basics of Computer Hardware01. Basics of Computer Hardware
01. Basics of Computer HardwareAkhila Dakshina
 
Introduction to Computer Architecture
Introduction to Computer ArchitectureIntroduction to Computer Architecture
Introduction to Computer ArchitectureAnkush Srivastava
 
Class note(1)...nazmun nahar(1834902176)
Class note(1)...nazmun nahar(1834902176)Class note(1)...nazmun nahar(1834902176)
Class note(1)...nazmun nahar(1834902176)Nazmun Nahar
 
Computer Hardware and servicing.pptx
Computer Hardware and servicing.pptxComputer Hardware and servicing.pptx
Computer Hardware and servicing.pptxHARIKRISHNANU13
 

Similar to HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer (20)

PERFORM MENSURATION AND CALCULATION OWN PPT.pptx
PERFORM MENSURATION AND CALCULATION OWN PPT.pptxPERFORM MENSURATION AND CALCULATION OWN PPT.pptx
PERFORM MENSURATION AND CALCULATION OWN PPT.pptx
 
PERFORM CALCULATION AND MENSURATION.pptx
PERFORM CALCULATION AND MENSURATION.pptxPERFORM CALCULATION AND MENSURATION.pptx
PERFORM CALCULATION AND MENSURATION.pptx
 
Processing
ProcessingProcessing
Processing
 
Computer hardware and its components
Computer hardware and its componentsComputer hardware and its components
Computer hardware and its components
 
Data Storage
Data StorageData Storage
Data Storage
 
Chapter 2 Hardware and Software
Chapter 2 Hardware and SoftwareChapter 2 Hardware and Software
Chapter 2 Hardware and Software
 
Computer Hardware And Configuration
Computer Hardware And ConfigurationComputer Hardware And Configuration
Computer Hardware And Configuration
 
Introduction to Computer Hardware slides ppt
Introduction to Computer Hardware slides pptIntroduction to Computer Hardware slides ppt
Introduction to Computer Hardware slides ppt
 
00 Hardware Of Personal Computer V1 1
00 Hardware Of Personal Computer V1 100 Hardware Of Personal Computer V1 1
00 Hardware Of Personal Computer V1 1
 
DEFINE COMPUTER SYSTEM
DEFINE COMPUTER SYSTEMDEFINE COMPUTER SYSTEM
DEFINE COMPUTER SYSTEM
 
Nt1310 Unit 3 Computer Components
Nt1310 Unit 3 Computer ComponentsNt1310 Unit 3 Computer Components
Nt1310 Unit 3 Computer Components
 
CS 3112
CS 3112CS 3112
CS 3112
 
CS 3112 - First Assignment -Mark Bryan F. Ramirez/BSCS-3E
CS 3112 - First Assignment -Mark Bryan F. Ramirez/BSCS-3ECS 3112 - First Assignment -Mark Bryan F. Ramirez/BSCS-3E
CS 3112 - First Assignment -Mark Bryan F. Ramirez/BSCS-3E
 
01. Basics of Computer Hardware
01. Basics of Computer Hardware01. Basics of Computer Hardware
01. Basics of Computer Hardware
 
Lecture 02 hardware
Lecture 02 hardwareLecture 02 hardware
Lecture 02 hardware
 
Introduction to Computer Architecture
Introduction to Computer ArchitectureIntroduction to Computer Architecture
Introduction to Computer Architecture
 
Class note(1)...nazmun nahar(1834902176)
Class note(1)...nazmun nahar(1834902176)Class note(1)...nazmun nahar(1834902176)
Class note(1)...nazmun nahar(1834902176)
 
Microcontroller part 1
Microcontroller part 1Microcontroller part 1
Microcontroller part 1
 
Chap1 chipset
Chap1 chipsetChap1 chipset
Chap1 chipset
 
Computer Hardware and servicing.pptx
Computer Hardware and servicing.pptxComputer Hardware and servicing.pptx
Computer Hardware and servicing.pptx
 

More from AMD Developer Central

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14AMD Developer Central
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...AMD Developer Central
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14AMD Developer Central
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14AMD Developer Central
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14AMD Developer Central
 

More from AMD Developer Central (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 

Recently uploaded

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

  • 1. BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD     HOW  TO  MAKE  THE  MOST  OF  GPU  ACCESSIBLE  MEMORY   PAUL  BLINZER   FELLOW,  SYSTEM  SOFTWARE,  AMD  
  • 2. THE  AGENDA   !  What’s  so  special  about  dealing  with  memory  and  a  GPU?   ‒  The  programmer’s  view  of  memory   ‒  Throwing  a  GPU  into  the  mix   ‒  How  do  today’s  systems  deal  with  GPU  memory  access?   !  The  many  different  “types”  of  memory  today  and  ways  to  access   ‒  The  various  places  to  find  and  best  use  them   ‒  What  changes  with  HSA  and  hUMA?   ‒  Why  “buffered”  view  of  memory  is  s[ll  important  and  how  to  deal  with  it   !  Where  to  find  more  informa[on?   !  Q  &  A   2   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  
  • 3. WHAT’S  SO  SPECIAL  ABOUT  MEMORY  ACCESS  WITH  A  GPU?   THERE  ARE  SO  MANY  DIFFERENT  TYPES,  BUSES  AND  CACHES  INVOLVED…   LDS = Local Data Share TU = Texture Unit TC = Texture Cache Discrete GPU Accelerated Processing Unit (APU) CPU GPU Instruction Cache 1..N Compute Units CoreM DC (L1) Global Data Share Instruction Cache Global Data Share H-CU Engine TU L1 (TC) LDS H-CU Engine TU L1 (TC) H-CU Engine TU L1 (TC) LDS H-CU Engine TU L1 (TC) LDS H-CU Engine TU L1 (TC) LDS H-CU Engine TU L1 (TC) LDS H-CU Engine TU L1 (TC) LDS H-CU Engine TU L1 (TC) L2 Cache L2 Cache HSA MMU (IOMMUv2) Mem Memory Controller Memory (DDR3) Memory (DDR3) Cached non-cacheable Cached non-cacheable 3   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13   PCIE Memory Controller Mem L3 Memory (GDDR5) Memory (GDDR5) Mem IC, FPU, L2 LDS LDS 1..N Compute Units Core1 Core0 IC, FPU, L2 DC (L1) DC (L1) Core0 Core1 DC (L1) DC (L1) IC, FPU, L2 Mem Constant Cache Mem CoreM-1 DC (L1) 1..N Compute Units GPU Constant Cache Memory (GDDR5)
  • 4. THE  TYPICAL  APPLICATION’S  VIEW  OF  MEMORY  (1)   A  “GEDANKENEXPERIMENT”,  COMBINING  EINSTEIN  AND  TRON:   IMAGINE  YOU  ARE  A  CPU  CORE  EXECUTING  AN  APPLICATION  THREAD,  ACCESSING  DATA…   ‒  The  address  may  be  represented  by  a  32bit  or  64bit  (44/48bit)  wide  ptr  value     ‒  The  memory  content  may  not  even  be  resident  in  physical  memory,  paged  in   from  backup  storage  when  accessed,  maybe  pushing    other  content  out   ‒  CPU  caches  keep  an  oien  used  “working  set”  of  data  close  to  the  CPU  core’s   execu[on  units   ‒  CPU  cache  coherency  mechanisms    invalidate  cache  content  when  “outside   forces  “  (typically  other  CPU  cores)  update  the  content  of  system  memory  at  a   given    address,  ensuring  that  each  CPU  core  sees  the  same  data   Noncanonical  VA  Range 2 48-­‐1 2 47-­‐1 FB Aperture 2 47-­‐1 System  Physical   Memory  Space 2 44-­‐1 Page 0x78900000 Page Page Page Page Page Page 0x12340000 Page 0x00000000 0x00000000 Process1 4   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13   Mapped  via CPU  MMU GPU Buffer ‒  The  applica[on  code  has  a  “flat”  view  of  memory,  can  allocate  memory   from  the  OS,  write  &  read  data  at  that  address,  etc.   Managed  by  OS Allocation ‒  Each  CPU  core  may  operate  independently  on  a  “thread”  within  that  process   2 64-­‐1 Kernel   Mode   Address Space   ‒  Each  applica[on  is  associated  with  a  process  and  the  OS  isolates  the   address  space  of  one  process  from  any  other  on  the  system,  this  is   enforced  by  hardware  (MMU  =  “Memory  Management  Unit”)   Process  VA Space  (CPU) User  Process  Space !  Today’s  opera[ng  systems  have  an  applica[on  model  based  on  a   user  process  view  of  the  system   0x00000000 Process2
  • 5. THE  TYPICAL  APPLICATION’S  VIEW  OF  MEMORY  (2)   NOW  LET’S  SEE  HOW  A  GPU  SEES  THAT  SAME  MEMORY  TODAY  AND  ADDS  TO  IT…   ‒  Consider  the  resource  “handle  value  +  offset”  as  just  a  special  kind  of  “address”  outside  of  the   regular  process  address  space  ☺     5   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13   Alloc. Gfx Noncanonical  VA  Range Framebuffer 0x98765000 Page 0x00000000 2 47-­‐1 System  Physical   Memory  Space 2 44-­‐1 Page 0x78900000 Page Page Page GPU Buffer ‒  To  access  the  memory  content  (all  or  part  of  it),  an  API  provides  func[ons  like   MapResourceView(),  Lock(),  Unlock()  or  similar  ,establishing  “windows”  in  the   address  space  to  that  memory  either  to  GPU  or  CPU  or  put  into  staging  buffers   Mapped  via GPU  MMU Page 2 47-­‐1 User  Process  Space ‒  The  API  typically  only  provides  a  “handle”  referencing  the  object   GPU  physical memory (e.g.  discrete) 2 48-­‐1 ‒  CreateResource(),  CreateBuffer(),  CreateTexture()…     ‒  The  memory  is  managed  as  single  objects  (buffers,  resources,  textures,  …),   typically,  “malloc()-­‐ed”  memory  is  typically  not  directly  accessible  by  GPU   Mapped  via CPU  MMU FB Aperture !  GPU  accessible  memory  alloca[ons  are  handled  via  special  APIs   (DirectX,  OpenGL,OpenCL,  etc)   Managed  by Gfx  Driver Managed  by  OS GPU Buffer ‒  GPU  accessible  system  memory  is  “page-­‐locked”  and  can’t  move  while  the   memory  may  be  accessible  by  the  GPU,  even  if  it’s  currently  not  used  at  all   ‒  The  total  amount  of  memory  a  GPU  can  access  at  a  [me  is  limited  to  the   amount  of  page-­‐locked  memory  or  frame  buffer  memory     2 64-­‐1 GPU Virtual  Address  Space Page Allocation ‒  They  can  only  access  physical  memory  pages  as  far  as  the  OS  memory   management  is  concerned,  though  GPU  may  use  “virtual  addresses”   Process  VA Space  (CPU) Kernel   Mode   Address Space   !  GPUs  are  typically  managed  as  devices  by  opera[ng  systems:   Page 0x56780000 Page 0x12340000 Page 0x00000000 0x00000000 Process1 0x00000000 Process2 0x00000000 GPU
  • 6. THE  TYPICAL  APPLICATION’S  VIEW  OF  MEMORY  (3)   NOW  LET’S  SEE  HOW  A  GPU  SEES  THAT  SAME  MEMORY  TODAY  AND  ADDS  TO  IT…   6   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13   Mapped  via GPU  MMU 0x98765000 Page Page 2 47-­‐1 0x00000000 FB Aperture 2 47-­‐1 System  Physical   Memory  Space 2 44-­‐1 Page 0x78900000 Page ‒  depending  on  the  system  configura[on  (e.g.  PCIe  bus  access)   Page GPU Buffer Page Page Allocation User  Process  Space ‒  GPU  caches  are  typically    explicitly  managed  by  the  driver  and  need  to  be   refreshed  when  the  CPU  updates  memory  content   ‒  One  reason  is  hardware  complexity  to  make  this  performant   ‒  Depending  on  use  scenario,  the  GPU  accessible  memory  is  mapped  as   “writethrough”,  “uncached”  or  “writecombined”  by  the  OS  APIs   GPU  physical memory (e.g.  discrete) 2 48-­‐1 !  Data  visibility  (cache  coherency)  is  typically  soiware-­‐managed   ‒  CPU  cache  coherency,  when  accessing    system  memory  poten[ally   updated  by  a  GPU  may  not  be  always  guaranteed   Mapped  via CPU  MMU Kernel   Mode   Address Space   ‒  For  frequent  accesses  from  both  CPU  &  GPU,  the  transla[on  can  be  tediously   slow   ‒  Content  that  can  be  accessed  by  both  CPU  and  GPU  simultaneously  needs   data  visibility/coherency  rules  leading  to  the  next  issue…   Managed  by Gfx  Driver Managed  by  OS Framebuffer ‒  The  bad  thing  about  it  is  that  it’s  an  either/or  style  access   GPU Virtual  Address  Space Alloc. Gfx 2 64-­‐1 Noncanonical  VA  Range ‒  where  it  can  be  more  efficiently  stored  or  processed  (e.g.  2D  [ling)   Process  VA Space  (CPU) GPU Buffer ‒  The  good  thing  about  an  API  controlled  access  is  that  the  OS  and  &  driver   can  copy  the  content  to  someplace  else  and/or  to  a  different  format   Page 0x56780000 Page 0x12340000 Page 0x00000000 0x00000000 Process1 0x00000000 Process2 0x00000000 GPU
  • 7. IT’S  ALL  ABOUT  THROUGHPUT,  BANDWIDTH  AND  LATENCY…   KEEP  YOUR  DATA  CLOSE  AND  YOUR  FREQUENTLY  USED  DATA  EVEN  CLOSER…   LDS = Local Data Share TU = Texture Unit TC = Texture Cache Bandwidth: 100's GB/s Latency: <1-10's cycles Bandwidth: 100's-1000's GB/s Latency: <1-10's cycles Discrete GPU Accelerated Processing Unit (APU) CPU GPU Instruction Cache CoreM DC (L1) Global Data Share Constant Cache Global Data Share H-CU Engine TU L1 (TC) LDS H-CU Engine TU L1 (TC) H-CU Engine TU L1 (TC) LDS H-CU Engine TU L1 (TC) LDS H-CU Engine TU L1 (TC) LDS H-CU Engine TU L1 (TC) LDS IC, FPU, L2 LDS LDS 1..N Compute Units Core1 Core0 IC, FPU, L2 DC (L1) DC (L1) Core0 Core1 DC (L1) DC (L1) IC, FPU, L2 H-CU Engine TU L1 (TC) LDS H-CU Engine TU L1 (TC) ~15 GB/s X16 PCI-E 3.0 L2 Cache L3 IOMMUv2 ~17 GB/s (DDR3-2133) Mem Memory Controller Mem Instruction Cache ~17 GB/s (DDR3-2133) Memory (DDR3) Memory (DDR3) Cached non-cacheable Cached non-cacheable 7   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13   L2 Cache PCIE Memory Controller Latency: 10's-100's of cycles (mem, bus) Mem CoreM-1 DC (L1) 1..N Compute Units GPU Constant Cache ~90GB/s (3GHz MCLK) Memory (GDDR5) ~90GB/s (3GHz MCLK) Memory (GDDR5) Mem 1..N Compute Units Mem Caches: ~90GB/s (3GHz MCLK) Memory (GDDR5)
  • 8. IT’S  ALL  ABOUT  THE  RIGHT  TOOL  FOR  THE  JOB(1)   !  The  efficient  use  of  a  GPU  &  CPU  in  a  system  depends  understanding  their  opera[on  on  memory   ‒  The  cache  architecture  on  either  CPU  and  GPU  is  a  reflec[on  of  the  different  access  pauerns  for  their  “preferred”   workloads  and  data  and  so  is  the  cache  management/op[miza[on     !  CPU’s  are  typically  built  to  operate  on  general  purpose,  serial  instruc[on  threads,  oien  high  data  locality,   lot’s  of  condi[onal  execu[on  and  dealing  with  data  interdependency   ‒  CPU  cache  hierarchy  is  focused  on  general  purpose  data  access  from/to  execu[on  units,  feeding  back  previously   computed  data  to  the  execu[on  units  with  very  low  latency   ‒  Compara[vely  few  registers  (vs  GPUs),  but  large  caches  keep  oien  used  “arbitrary”  data  close  to  the  execu[on   units     !  GPUs  are  usually  built  for  a  SIMD  execu[on  model     ‒  Apply  the  same  sequence  of  instruc[ons  over  and  over  on  data  with  liule  varia[on  but  high  throughput  (“streaming   data”),  passing  the  data  from  one  processing  stage  to  another  (latency  tolerance)   ‒  Compute  units  have  a  rela[vely  large  register  file  store   ‒  Using  a  lot  of  “specialty  caches”  (constant  cache,  Texture  Cache,  etc),  data  caches  op[mized  for  SW  data  prefetch   ‒  LDS,  GDS  mainly  used  for  in-­‐wavefront  or  inter-­‐wavefront  updates  &  synchroniza[on   ‒  Data  caches  are  typically  explicitly  flushed  by  soiware   8   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  
  • 9. IT’S  ALL  ABOUT  THE  RIGHT  TOOL  FOR  THE  JOB(2)   !  The  GPU  memory  &  cache  access    design  is  well-­‐suited  for  typical  2D   &  3D  graphics  workloads  (duh!)   ‒  Ver[ces  data,  Textures,  etc  are  passed  from  the  host  to  the  various  stages   of  the  graphics  API  pipeline,  with  each  stage  allowing  processing  of  the   data  passing  through  via  appropriate  instruc[on  sequences  (“shaders”)   ‒  Since  a  lot  of  the  data  is  “sta[c”  and  the  access  is  abstracted  via  APIs,  it  can   be  put  into  beuer  suited  data  formats  mapping  2D/3D  pixel  coordinates   “locality”  to  memory  locality  in  internal  buffers  within  the  graphics  pipeline     2D  Tiling   Y-­‐Coordinate X-­‐Coordinate 16x16 16x16 16x16 16x16 16x16 ... ‒  Very  beneficial  for  performance,  but  not  easily  “accessible”  by  simple  addressing   schemes,  requires  copy  of  the  data  first   ‒  Today’s  graphics  APIs  (OpenGL,  Direct3D  are  well  suited  for  this  workload,   but  oien  must  focus  on  the  lowest-­‐common  denominator  in  hardware   capabili[es   ‒  The  API  design  assumes  that  no  cache  coherency  between  CPU  and  GPU   may  exist,  requiring  the  CPU  to  issue  explicit  cache  flushes  or  operate  on   memory  areas  mapped  as  “uncached”  if  readback  of  GPU  data  is  required   ‒  Some  extensions  or  recently  introduced  features  for  “zero  copy”  memory   9   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13   X0,Y0 X1,Y0 X2,Y0 Memory  Addresses ... X15,Y0 X0,Y1 X1,Y1 X2,Y1 ... X15,Y14 X0,Y15 X1,Y15 X2,Y15 ...
  • 10. IT’S  ALL  ABOUT  THE  RIGHT  TOOL  FOR  THE  JOB(3)   !  Vector/Matrix-­‐oriented  compute  workloads  map  well    to  GPUs,  but  un[l  now  “suffer”  from  some  of  the   choices  that  benefit  the  graphics  data  processing  flow   ‒  Compute  APIs  like  OpenCL™  or  DirectCompute  are  oien  s[ll  inherently  [ed  to  the  low-­‐level  graphics  focused  GPU   infrastructure  in  today’s  OS  (e.g.  memory  management  through  Microsoi®  WDDM,  Linux®  TTM/GEM)   ‒  “Zero  Copy”  Support  and  system  memory  buffer  cache  coherency  in  recent  API’s  improves  the  behavior  on  some   pla{orms  that  have  appropriate  support,  s[ll  has  some  SW  overhead  for  access   ‒  All  the  memory  processed  by  the  GPU  is  referenced  through  handles  to  control  memory  page-­‐lock  on  workload   dispatch  and  the  SW  needs  to  create  “Buffer  views”  either  explicitly  or  under  the  covers  to  access  regular  memory   ‒  There  is  quite  some  SW  overhead  involved  in  that   !  Discrete  GPU  have  excellent  compute  performance  (several  TeraFLOPS  for  even  mid-­‐range  cards)   ‒  But  require  the  data  to  be  accessible  in  local  memory  for  best  performance,  requiring  copy-­‐opera[ons  from  host   memory  and  “keeping  the  data  on  the  other  side”  as  long  as  possible     ‒  Accessing  or  pushing  the  data  back  and  forth  through  the  PCIe  bouleneck  may  reduce  or  eliminate  speedup-­‐gains  or   increases  access  latency  from  host  substan[ally     10   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  
  • 11. HOW  DOES  HUMA    AND  HSA  CHANGE  THINGS  ?   ‒  Pla{orm  atomics  are  supported,  for  efficient  synchroniza[on   11   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13   Alloc. Gfx Framebuffer Noncanonical  VA  Range 0x98765000 Page Page 2 47-­‐1 0x00000000 FB Aperture 2 47-­‐1 System  Physical   Memory  Space 2 -­‐1 2 47-­‐1 FB Aperture 2 47-­‐1 Mapped  via HSA  MMU GPU  physical memory (e.g.  discrete) 44 Page 0x78900000 GPU Buffer 0x78900000 GPU Buffer Page Page Page Page Page Page 0x12340000 User  Process  Space ‒  A  data  pointer  has  the  same  “meaning”  (=  points  to  the  same  content)  in   system  memory  (also  known  as  “ptr-­‐is-­‐ptr”)   ‒  On  OS  that  support  HSA  MMU  func[onality,  the  page  tables    may  be  even   shared  and  the  OS  may  support  na[ve  GPU  demand    paging     ‒  The  GPU  may  s[ll  support  addi[onal  address  ranges  for  special  purposes     (e.g.  frame  buffer  memory,  LDS,  scratch,  …)     Managed  by OS  &  Gfx  driver Allocation ‒  The  GPU’s  virtual  address  page  table  mapping  is  set  to  a  process   address  view  of  the  memory  space   Mapped  via CPU  MMU Allocation ‒  Reads  and  updates  of  system  memory  from  one  will  cause    cache  line   flushes  or  line  invalida[on  on  the  other  processors  in  the  system   ‒  SW  does  not  have  to  deal  with  explicit  cache  line  flushes  or  invalida[ons   for  such  transac[ons  anymore,  it  works  like  for  any  CPU  core  in  the  system   ‒  This  fully  works  for  APUs,  where  GPU  and  CPU  have  access  to  the  same   system  memory  controller,  par[al  support  for  discrete  GPU   GPU Virtual  Address  Space Managed  by  OS Kernel   Mode   Address Space   ‒  It’s  the  same  layout,  just  a  different  visualiza[on  (focus  on  bit47  ☺)   ‒  There  is  efficient  hardware  support  for  GPU  &  CPU  cache  coherency   on  memory  load/store  opera[ons  by  the  GPU     Process  VA Space  (CPU) 2 64-­‐1 User  Process  Space !  First,  let’s  redraw  the  address  layout  map  from  before…   0x12340000 Page 0x00000000 0x00000000 Process1 0x00000000 Process2 0x00000000 Process1 0x00000000 Process2 GPU
  • 12. THERE  ARE  STILL  REASONS  FOR  THE  “BUFFERED  VIEW”  OF  MEMORY   !  HSA  and  hUMA  are  very  useful  for  compute  jobs  and  graphics  data  oien  updated  by  the  host  CPU   ‒  Allows  fine-­‐grained  “interac[ve”  sharing  of  data  between  CPU  and  GPU  threads  without  requiring  prophylac[c   cache  flushes  and  other  synchroniza[on   !  But  the  “direct  view”  and  access  to  common  memory  is  less  beneficial  for  other  graphics  data   ‒  Many  graphics  algorithms  have  been  designed  with  an  “abstract”  or  “deferred”  view  of  memory,  focusing  on   “dimensional  addressing”  of  the  data  in  the  shaders  (e.g.  x/y/z,  u/w  coordinates)   ‒  Many  GPUs  use  hardware-­‐specific  texture  [ling  formats  that  are  op[mized  for  a  specific  memory  channel  layout  to   reach  maximum  performance,  complicated  to  address  by  soiware  in  a  general  way     ‒  An  applica[on  may  have  mul[ple  graphics  contexts  concurrently  per  process  (per  API),  vs  just  one  for  “flat”   ‒  A  lot  of  graphics  data  (e.g.  textures,  ver[ces,  et  al)  are  not  changing  oien  through  CPU  updates   ‒  requiring  cache  coherency  increases  HW  access  overhead  for  liule  benefit   ‒  Many  specialty  resources  (e.g.  Z-­‐Buffer)  have  GPU-­‐specific  implementa[on  with  no  “external”  visibility   ‒  Leveraging  the  much  higher  performance  of  a  Discrete  GPU  and  its  frame  buffer  memory  is  somewhat  more   complicated,  if  an  applica[on  needs  to  deal  with  the  memory  loca[on  directly   !   Most  common  graphics  APIs  today  don’t  know  how  to  deal  with  virtual  addresses   ‒  This  will  change  in  the  future  as  u[lizing  virtual  addresses  within  graphics  APIs  becomes  commonplace   12   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  
  • 13. GRAPHICS  INTEROPERATION  IS  IMPORTANT   !  There  are  many  different  graphics/GPU  APIs  in  use,  using  buffers/resources  to  access  memory   ‒  As  seen  before,  there  are  good  reasons  to  keep  the  content  in  “buffers”  either  due  to  legacy  or  performance   ‒  It  also  may  not  make  sense  to  “waste”  virtual  address  space  e.g.  on  32bit  apps  for  resources  not  accessed  by  host   ‒  But  this  may  also  makes  it  harder  to  access  the  content    from  either  CPU  or  through  a  “flat  addressing”  aware  GPU   !  Explicit  interopera[on  APIs  to  tradi[onal  graphics  APIs  provides  two  views  of  a  resource   ‒  The  transla[on  between  “handle  +  offset”  and  “flat  address”  is  dealt  within  the  run[me  and  driver   ‒  The  transla[on  itself  may  be  straigh{orward  and  very  efficient  however   !  Specialty  GPU  resources  (e.g.  LDS,  scratch)  may  be  mapped  into  the  “flat”  process  address  space,  but  may   not  be  accessible  by  the  CPU  host  since  they’re  not  accessible  from  the  “outside”   ‒  This  is  no  different  than  some  other  system  memory  mappings    provided  by  the  OS   !  Applica[ons  should  focus  on  an  efficient  processing  of  the  data  on  the  “compute”  with  a  dedicated   handover  to  the  “graphics”  side  when  appropriate   ‒  As  graphics  APIs  are  updated  over  [me  to  take  advantage  of  flat  addressing  models  (e.g.  for  “bindless  textures”)  the   need  for  the  interopera[on  mechanisms  may  gradually  vanish  for  most  graphics  data   13   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  
  • 14. ADDITIONAL  CONSIDERATIONS   !  A  lot  of  today’s  PC  systems  have  more  than  one  GPU  available  to  the  programmer   ‒  Almost  all  of  todays  CPUs  are  actually  APUs  and  have  both  CPU  and  GPU  on  chip,  using  the  same  memory  controller   ‒  On  performance  systems,  a  discrete  GPU  with  dedicated  frame  buffer  memory  may  be  present,  too   !  The  integrated  GPU  may  support  cache  coherency  for  system  memory  updates  and  therefore  is   preferen[al  for  GPU  compute  tasks  via  e.g.  DirectCompute  or  OpenCL™   ‒  The  performance  uplii  vs  CPU  may  differ,  but  typically  there  oien  is  a  >10  [mes  factor  for  vector  computa[ons  vs   equivalent  CPU  instruc[ons   !  Discrete  GPU  can  focus  on  graphics  workload  accelera[on,  further  processing  the  data  pre-­‐processed  by   either  host  CPU  or  integrated  GPU  for  further  uplii   ‒  Dedicated  transfer  from/to  discrete  GPU  frame  buffer   ‒  For  appropriate  compute  workloads,  consider  the  addi[onal  performance  uplii  through  compute  on  discrete  GPU   !  The  controls  may  be  in  a  driver  as  part  of  collabora[ve  render  (e.g.  AMD  DualGraphics)  where  the   compute  processing  on  the  integrated  GPU  via  appropriate  APIs  interoperates  with  the  “graphics”  device   ‒  The  graphics  driver  operates  in  a  “Crossfire”  mode  for  integrated  and  discrete  GPU   ‒  Whereas  the  compute  device  operates  on  a  DirectCompute  or  OpenCL™  “device”  on  the  integrated  GPU   14   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  
  • 15. SUMMARY   !  HSA  and  hUMA  substan[ally  simplifies  data  exchange  between  GPU  and  CPU,  processing  it  on  both  sides   ‒  Benefits  from  a  flat  address  model  where  data  pointer  references  to  content  can  be  resolved  on  either  side     ‒  It  works  best  for  compute-­‐heavy  workload,  where  frequent  data  updates  and  result  retrieval  is  important   !  There  are  s[ll  benefits  to  keep  some  graphics  data  in  a  “buffered”  address  mode  through  graphics  APIs   ‒  Leverages  “specialty  caches”,  discrete  GPU  and  storage  within  the  GPU  that  is  op[mized  for  graphics  data  but   makes  it  “less  accessible”  for  CPU  host  access   !  With  appropriate,  efficient  interopera[on  between  the  “buffered”  and  the  “flat”  resource  view  on  the   GPU  the  applica[on  can  easily  traverse  between  these  two  data  representa[ons   ‒  An  HSA  compliant  GPU  allows  for  a  very  efficient  transla[on  between  these  two  representa[ons   ‒  Current  compute  &  graphics  API’s  can  be  supported  in  this  scheme     ‒  With  na[ve  support  for  a  “flat  model”  in  upcoming  modern  OS,  direct,  “flat”,  cache  coherent  references  to  memory   resources  will  become  easier  to  use  directly  over  [me,  reducing  the  need  for  explicit  transla[on   !  Take  advantage  of  all  the  GPUs  and  all  the  memory  you  find  on  a  system!   ‒  There’s  oien  more  than  one  and  all  have  their  advantages   15   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  
  • 16. WHERE  TO  FIND  MORE  INFORMATION   THIS  PRESENTATION  IS  ONLY  A  START…   !  AMD  Accelerated  Parallel  Processing  (APP)  SDK:   ‒  hup://developer.amd.com/tools-­‐and-­‐sdks/heterogeneous-­‐compu[ng/amd-­‐accelerated-­‐parallel-­‐processing-­‐app-­‐sdk/   ‒  AMD  APP  SDK  is  a  complete    development  pla{orm,  providing  samples,  documenta[on  and  other  materials  to   quickly  get  you  started  using  OpenCL™,  Bolt  (Open  Source  C++  Template    Library  for    GPU  parallel  processing),    C+ +AMP  or  Aparapi  for  Java  applica[ons   !  AMD  CodeXL:   ‒  hup://developer.amd.com/tools-­‐and-­‐sdks/heterogeneous-­‐compu[ng/codexl/   ‒  A  powerful  tools  suite  for  Windows®  and  Linux®  heterogeneous  applica[on    debugging  and  profiling   ‒  Works  standalone  and    e.g.    Integrated  as  Visual  Studio  extension   !  AMD  Developer  Central:  hup://developer.amd.com   ‒  Docs,  whitepapers,  tools;  Everything  you  want  to  know  and  need  to  write  performant  programs    on  heterogeneous   systems.   ‒  It’s  not  about  either  CPU  or  GPU,  its  about  both…   16   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  
  • 17. GO  AHEAD  ☺   17   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  
  • 18. DISCLAIMER  &  ATTRIBUTION   The  informa[on  presented  in  this  document  is  for  informa[onal  purposes  only  and  may  contain  technical  inaccuracies,  omissions  and  typographical  errors.     The  informa[on  contained  herein  is  subject  to  change  and  may  be  rendered  inaccurate  for  many  reasons,  including  but  not  limited  to  product  and  roadmap   changes,  component  and  motherboard  version  changes,  new  model  and/or  product  releases,  product  differences  between  differing  manufacturers,  soiware   changes,  BIOS  flashes,  firmware  upgrades,  or  the  like.  AMD  assumes  no  obliga[on  to  update  or  otherwise  correct  or  revise  this  informa[on.  However,  AMD   reserves  the  right  to  revise  this  informa[on  and  to  make  changes  from  [me  to  [me  to  the  content  hereof  without  obliga[on  of  AMD  to  no[fy  any  person  of   such  revisions  or  changes.     AMD  MAKES  NO  REPRESENTATIONS  OR  WARRANTIES  WITH  RESPECT  TO  THE  CONTENTS  HEREOF  AND  ASSUMES  NO  RESPONSIBILITY  FOR  ANY   INACCURACIES,  ERRORS  OR  OMISSIONS  THAT  MAY  APPEAR  IN  THIS  INFORMATION.     AMD  SPECIFICALLY  DISCLAIMS  ANY  IMPLIED  WARRANTIES  OF  MERCHANTABILITY  OR  FITNESS  FOR  ANY  PARTICULAR  PURPOSE.  IN  NO  EVENT  WILL  AMD  BE   LIABLE  TO  ANY  PERSON  FOR  ANY  DIRECT,  INDIRECT,  SPECIAL  OR  OTHER  CONSEQUENTIAL  DAMAGES  ARISING  FROM  THE  USE  OF  ANY  INFORMATION   CONTAINED  HEREIN,  EVEN  IF  AMD  IS  EXPRESSLY  ADVISED  OF  THE  POSSIBILITY  OF  SUCH  DAMAGES.     ATTRIBUTION   ©  2013  Advanced  Micro  Devices,  Inc.  All  rights  reserved.  AMD,  the  AMD  Arrow  logo  and  combina[ons  thereof  are  trademarks  of  Advanced  Micro  Devices,   Inc.  in  the  United  States  and/or  other  jurisdic[ons.  OpenCL  is  a  trademark  of  Apple  Corp.  and  Linux  is  a  trademark  of  Linus  Torvalds  and  Microsoi  is  a   trademark  of  Microsoi  Corp.  PCI  Express  is  a  trademark  of  PCI  SIG  Corpora[on.  Other  names  are  for  informa[onal  purposes  only  and  may  be  trademarks  of   their  respec[ve  owners.   18   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13