PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the OpenJDK Graal infrastructure, by Vasanth Venkatachalam

WHOLLY
GRAAL:
ENABLING
GPU
ACCELERATION
OF
JAVA

USING

THE
OPENJDK
GRAAL
COMPILER.

VASANTH
VENKATACHALAM

AGENDA

!  Why
should
you
be
interested
in
GPU
oﬄoad?

!  Java
execuQon
model

!  Requirements
for
Java
GPU
enablement

!  Sumatra
OpenJDK
Project
for
Java
GPU
enablement

!  Heterogeneous
System
Architecture
(HSA)

!  IntroducQon
to
Graal

!  JDK8
based
Graal
prototype
for
Java
GPU
oﬄoad

!  Future
work

!  Summary

2
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

WHY
SHOULD
YOU
BE
INTERESTED
IN
GPU
OFFLOAD?

!  In
many
instances,
offloading
the
data-‐parallel
parts
of
a
program
to
a
GPU
will
improve
the
performance

compared
to
running
the
enQre
program
on
the
CPU

‒ A

typical
GPU
offers
more
cores
for
the
same
density
than
a
CPU

‒ AMD
Radeon™
HD
7750
features
512
Stream
Processors!

‒ In
a
data-‐parallel
computaQon
in
which
the
same
computaQon
is
repeated
over
different
data
(and
the

results
are
not
dependent
on
each
other),
the
individual
computaQons
can
be
executed
in
parallel
on

mulQple
cores

!  Example:
Squaring
array
elements

for(int
i
=
0;
i
<
in.length;
i++)

{

out[i]
=
in[i]
*
in[i];

}

In[0]*in[0],
in[1]*in[1],
in[2]*in[2]…

core0

3
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

core1

core2

JAVA
EXECUTION
MODEL

!  Java
is
a
managed
run*me
language
,
a
language
that
runs

on
top
of
a
virtual
machine
(VM)

‒ Other
managed
runQme
languages
include
Ruby,

JavaScript,
Python,
Scala

!  Java
source
is
compiled
into
an
intermediate
format
called

bytecode

!  The
Java
virtual
machine
(JVM)
executes
the
bytecodes

using
interpreta*on
or
just-‐in-‐*me
compila*on.

‒  InterpretaQon

involves
a
straight

bytecode
to
machine

code

translaQon,
instrucQon
by
instrucQon

‒  Just-‐in-‐Time
CompilaQon
(JIT)
involves
compiling
bytecodes
into

machine
code
at
runQme
and
execuQng
the
machine
codes.

!  Examples
of
JVMs
include:

‒  Oracle
Hotspot™
JVM

‒  IBM
J9
VM

4
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

Java
Source
Code
(Hello
World)

public
staQc
void
main(String[]
args)
{

System.out.println(“Hello”);

}

Java
Source
Compiler

Java
Bytecodes
(Hello
World)

0:
getstaQc

#13

3:
ldc

#19

5:
invokevirtual
#21

8:
return

Java
Virtual
Machine

Machine
code

REQUIREMENTS
FOR
JAVA
GPU
ENABLEMENT

!  Java
needs
a
programming
model
to
express
data-‐parallel
workloads

‒  Java

8
for
example
has
the
Stream
API
with
support
for
Lambda
constructs

!  The
Java
Virtual
Machine
(JVM)
needs
to
generate
code
for
the
GPU
as
well
as
the
CPU

‒ The
JVM
has
to
target
mulQple
InstrucQon
Set
Architectures
(ISAs)

!  Ideally,
the
JVM
can
generate
code
in
a
standard
intermediate
language
that
can
be
translated
into
the

naQve
machine
instrucQons
of
each
GPU
target

‒ This
allows
for
portability

‒ Any
update
to
the
GPU
ISA
aﬀects
only
the
translaQon
of
this
intermediate
language
into
the
GPU

machine
instrucQons

5
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

SUMATRA
OPENJDK
PROJECT
FOR
JAVA
GPU
ENABLEMENT

!  Open
source
project
intending
to
enable
Java
applicaQons
to
take
advantage
of

the
GPU

‒  More
or
less
transparently
to
the
applicaQon

!  Project
started
by
Oracle
and
AMD
shortly
before
JavaOne
2012

!  We
are
developing
a
prototype
of
Sumatra
using
the
Heterogeneous
System
Architecture
(HSA)
and
the

Graal
OpenJDK
project

‒  Backend
for
Graal
Just-‐In-‐Time
(JIT)
Compiler
which
compiles
Java
into
HSAIL
for
GPU
execuQon

‒  Project
home
page:
hwp://openjdk.java.net/projects/graal/

6
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

EXAMPLE
OF
THE
KINDS
OF
CODE
WE’D
LIKE
TO
RUN
ON
THE
GPU

class NameInfo {
private String name;
private boolean exists;
public void checkExistsIn(String text) {
exists = text.contains(name);
}
};
NameInfo allNames[];
String longText;
IntStream istr = IntStream.range(0, allNames.length);
istr.forEach(i -> {
allNames[i].checkExistsIn(longText);
});

Our
prototype
can
handle
this
today!

7
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

HETEROGENEOUS
SYSTEM
ARCHITECTURE
(HSA)

!  Heterogeneous
System
Architecture
standardizes
CPU/GPU
funcQonality
via
a
common
intermediate

language
(HSAIL)
and
runQme
(the
HSA
stack)

‒ ISA-‐agnosQc
for
both
CPUs
and
accelerators

‒ Support
high-‐level
programming
languages

!  HSA
makes
a
great
plazorm
for
GPU
oﬄoad

‒ Shared
Virtual
Memory

‒ Direct
access
to
heap
objects
in
main
memory
from
GPU
cores

‒ In
other
words,
“a
pointer
is
a
pointer”

‒ Eliminates
the
overhead
of
copying
data
from
CPU
to
GPU

‒ Eliminates
the
overhead
of
bookkeeping
pointers

!  SpeciﬁcaQons
and
simulator
available
from
HSA
FoundaQon

‒  hwp://hsafoundaQon.com/

‒  hwp://hsafoundaQon.com/standards/

8
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

HSA
PARALLEL
EXECUTION
MODEL

!  Grid
based
execuQon
model

!  Programmer
supplies
a
“kernel”
that
is
run
on
each
work-‐item

! 
Kernel
is
wriwen
as
a
single
thread
of
execuQon
and
represents
the
main
body
of
work
each
work-‐item

will
execute

!  Each
work-‐item
has
a
unique
id

!  Programmer
speciﬁes
the
number
of
work-‐items
(for
scope
of
problem)

for(int
i
=
0;
i
<
in.length;
i++)

{

out[i]
=
in[i]
*
in[i];

}

work-‐item

(in[i]*in[i)]

9
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

Grid
size

(number
of
mulQplicaQons
to

be
done)

HSAIL
PRIMER

" HSAIL
is
the
code
that
the
Graal
backend
will
emit

" Gets
translated
to
the
ISA
of
the
GPU
device
by
a
runQme
layer
known
as
the
“finalizer”

" Generated
code
is
ASCII
text
form,
which
aids
in
debugging

" Example:
signed
32-‐bit
mulQplicaQon

mul_s32
$s3,
$s0,
$s1

Mnemonic

(mul,
add,

sub,
div,

Etc.)

DesQnaQon

Type
modifier

(s,
u,
b,
f)

Length
modifier

(1,
8.,
16,
32,
64
etc)

10
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

Source1

Source2

INTRODUCTION
TO
GRAAL

!  Graal
is
a
highly
extensible,
open-‐source,
just-‐in-‐Qme
compiler
for
Java

‒  Project
home
page:

!  Graal
is
wriwen
in
Java

‒  Graal
can
be
developed
using
exisQng
Java

IDEs
(e.g.,
Eclipse,
NetBeans)
making
it
straighzorward
to
debug

‒  Because
Graal
is
wriwen
in
Java,
it
can
run
on
any
plazorm
and
be
treated
as
a
cross-‐compiler

‒  In
parQcular,
Graal
can
compile

Java
for
the
GPU
while
running
on
the
CPU

!  Graal
is
being
used
to
develop
a
centralized
framework
(Truﬄe)
for
execuQng
JVM
languages

‒  hwp://www.oracle.com/technetwork/java/jvmls2013wimmer-‐2014084.pdf

‒  Adding
a
GPU
backend
to
Graal
potenQally
opens
the
door
to
GPU
enablement
of
other
JVM
languages
using
a

centralized
framework,
but
this
is
an
area
for
future
work

11
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

JDK8
BASED
GRAAL
PROTOTYPE
FOR
JAVA
GPU
ENABLEMENT

!  Graal
has
been
extended
with
a
prototype
backend
that
generates
HSAIL
code
for
GPU
execuQon

!  This
allows
porQons
of
Java
8
programs
using
the
Stream
API
to
be
compiled
into
HSAIL

!  This
prototype
has
been
tested
using
a
simulator
as
well
as
real
hardware

‒ On
Mandelbrot
we
get
a
speedup
of
10x
running
on
the
GPU
versus
running
using
Java
threads
on
the

x86
CPU

!  The
HSAIL
backend
has
been
checked
into
the
Graal
OpenJDK
repository

12
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

HOW
GRAAL
WORKS
WITH
THE
HSA
RUNTIME
STACK
FOR

A
JAVA
PROGRAM

Java Application
Java JDK Stream +
Lambda API

Java GRAAL JIT
backend
HSAIL

IR
GeneraQon/OpQmizaQon

Graal

HSAIL
code
generaQon

HSAIL
ﬁnalizer
and

runQme

HSAIL
code

JVM
CPU
ISA

CPU

GPU ISA
GPU

13
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

EXAMPLE
HSAIL
CODE
GENERATED
FOR
A
SAMPLE
JAVA
PROGRAM

Intstream
forEach
(i-‐>
{

out[i]
=
in[i]
*
in[i];

});

What
the
compiler
sees!

private
staQc
void
lambda$67(int[]
out,
int[]
in,,
int
i)

{

out[i]
=
in[i]
*
in[i];

}

Parameter
passed
to
lambda

" Data-‐parallel
execuPon
model

" Each
workitem
has
a
unique
id

" workitemabsid
instrucPon
returns
the
id
of
the
current

workitem

14
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

kernel
&run
(

kernarg_u64
%_arg0,

kernarg_u64
%_arg1

)
{

ld_kernarg_u64

$d6,
[%_arg0];

Parameter
passing

ld_kernarg_u64

$d2,
[%_arg1];

workitemabsid_u32
$s1,
0;

Load
id
of

current
workitem

cvt_s64_s32
$d0,
$s1;

mul_s64
$d0,
$d0,
4;

add_u64
$d2,
$d2,
$d0;

Load
in[i]

ld_global_s32
$s0,
[$d2
+
24];

mul_s32
$s3,
$s0,
$s0;

in[i]
*
in[i]

cvt_s64_s32
$d1,
$s1;

mul_s64
$d1,
$d1,
4;

add_u64
$d6,
$d6,
$d1;

st_global_s32
$s3,
[$d6
+
24];

Store
to
out[i]

ret;

};

EXAMPLES
OF
FUNCTIONALITY
WE
SUPPORT

•  some
Math
intrinsics:

Intstream.range(0, in.length).forEach(i->
{

out[i] = Math.sqrt( in[i] )*in[i];
});

•  arrays,
string
manipulaQon
rouQnes,
calls
to
some
JDK
methods

Intstream.range(0, boolArray.length).forEach(i-{
boolArray[i] = (inArray[i]).contains(“hello”);
});

•  instanceOf
operator

Shape shapeArray[];
return shapeArray[i] instanceof Circle;

15
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

EXAMPLES
OF
FUNCTIONALITY
WE
SUPPORT

•  IntStream use case

Public class Point {
public double x;

public double y;

}
Point[] pointArray;
Intstream.range(0, pointArray.length).forEach(i -> {
pointArray[i].x ++;
});
•  ObjectStream from Array
Arrays.stream(pointArray).forEach(p -> {
p.x ++;
});

16
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

EXAMPLES
OF
FUNCTIONALITY
WE
SUPPORT

•  ObjectStream from ArrayList

ArrayList<Point> pointList;
pointList.stream().forEach(p ->

{
p.x ++;
});

•  Atomic operations (patch forthcoming)
AtomicInteger atomicInt;
i -> {
outArray[i] = atomicInt.incrementAndGet( );
}

17
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

FUTURE
WORK

!  GPU
enablement
for
managed
runQme
languages
other
than
Java
is
an
area
for
future
work

!  One
path
is
to
develop
a
mechanism
that
allows
other
languages
to
call
the
Java
8
Stream
API

‒  This
could
leverage
exisQng
work
to
make
other
JVM
languages
interoperable
with
Java

‒  hwp://agiledeveloper.com/presentaQons/integraQng_jvm_languages_javaone.zip

!  Another
path
is
to
develop
a
centralized
framework
that
allows
JVM
languages
to
be
compiled
into
a

format
that
Graal
can
take
as
input

!  Truffle
is
a
prototype
language
implementaQon
framework
wriwen
in
Java
that
uses
the
Graal
JIT
compiler

‒  hwps://wiki.openjdk.java.net/display/Graal/Truffle+FAQ+and+Guidelines

‒  The
OpenJDK
community
has
developed
prototype
implementaQons
of
JavaScript,
Ruby
and
R
on
Truffle

‒  hwp://www.oracle.com/technetwork/java/jvmls2013wimmer-‐2014084.pdf

‒  hwp://www.oracle.com/technetwork/java/jvmls2013vitek-‐2013524.pdf

‒  Using
Graal
as
the
JIT
compiler
potenQally
allows
other
JVM
languages
to
take
advantage
of
Graal’s
HSAIL
backend

18
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

SUMMARY

!  GPU
offload
is
beneficial
for
improved
performance

!  We
have
extended
the
Graal
Just-‐In-‐Time
compiler
with
a

prototype
backend
that
generates
HSAIL
code

!  This
work
opens
the
door
to
GPU
acceleraQon
for
Java

!  GPU
acceleraQon
for
other
managed
runQme
languages
is
an
area
for
future
work

‒  Truffle
may
make
this
possible
by
providing

a
centralized
framework
for
language
implementaQon
using
the
Graal

JIT
compiler

!  We
encourage
OpenJDK
community
feedback
and
contribuQons

19
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

REFERENCES

!  AMD
DevCentral
blog
on
HSAIL-‐based
GPU
Offload

‒  hwp://developer.amd.com/community/blog/hsail-‐based-‐gpu-‐offload-‐the-‐quest-‐for-‐java-‐performance-‐begins/

!  Sumatra
OpenJDK
GPU/APU
offload
project

‒  Project
home
page:
hwp://openjdk.java.net/projects/sumatra/

‒  Wiki:
hwps://wiki.openjdk.java.net/display/Sumatra/Main

!  Graal
JIT
compiler
and
runQme
project

‒  Project
home
page:

!  HSA
FoundaQon:

‒  hwp://hsafoundaQon.com/

‒  hwp://hsafoundaQon.com/standards/

20
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

REFERENCES

!  JVM
Language
Summit

2013
(JVMLS
2013)

‒  Wimmer
and
Seaton,
“One
VM
to
Rule
them
all”
:

‒ 
hwp://www.oracle.com/technetwork/java/jvmls2013wimmer-‐2014084.pdf

‒  Wuerthinger
and
Venkatachalam,
“Graal
and
GPU
offload”

‒  hwp://www.oracle.com/technetwork/java/jvmls2913wuerth-‐2013918.pdf

‒  Vitek,
“R
in
Java”

‒  hwp://www.oracle.com/technetwork/java/jvmls2013vitek-‐2013524.pdf

!  JavaOne
2013

‒  Thalinger,
Wimmer,
and
Venkatachalam,
“Wholly
Graal:
AcceleraQng
GPU
offload
for
Java”:

‒  hwps://oracleus.acQveevents.com/2013/connect/fileDownload/session/C2A34A60DEDE1B2D9FE9D87733345017/
CON6419_Wimmer.pdf

21
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

DISCLAIMER
&
ATTRIBUTION

The
informaQon
presented
in
this
document
is
for
informaQonal
purposes
only
and
may
contain
technical
inaccuracies,
omissions
and
typographical
errors.

The
informaQon
contained
herein
is
subject
to
change
and
may
be
rendered
inaccurate
for
many
reasons,
including
but
not
limited
to
product
and
roadmap

changes,
component
and
motherboard
version
changes,
new
model
and/or
product
releases,
product
differences
between
differing
manufacturers,
so•ware

changes,
BIOS
flashes,
firmware
upgrades,
or
the
like.
AMD
assumes
no
obligaQon
to
update
or
otherwise
correct
or
revise
this
informaQon.
However,
AMD

reserves
the
right
to
revise
this
informaQon
and
to
make
changes
from
Qme
to
Qme
to
the
content
hereof
without
obligaQon
of
AMD
to
noQfy
any
person
of

such
revisions
or
changes.

AMD
MAKES
NO
REPRESENTATIONS
OR
WARRANTIES
WITH
RESPECT
TO
THE
CONTENTS
HEREOF
AND
ASSUMES
NO
RESPONSIBILITY
FOR
ANY

INACCURACIES,
ERRORS
OR
OMISSIONS
THAT
MAY
APPEAR
IN
THIS
INFORMATION.

AMD
SPECIFICALLY
DISCLAIMS
ANY
IMPLIED
WARRANTIES
OF
MERCHANTABILITY
OR
FITNESS
FOR
ANY
PARTICULAR
PURPOSE.
IN
NO
EVENT
WILL
AMD
BE

LIABLE
TO
ANY
PERSON
FOR
ANY
DIRECT,
INDIRECT,
SPECIAL
OR
OTHER
CONSEQUENTIAL
DAMAGES
ARISING
FROM
THE
USE
OF
ANY
INFORMATION

CONTAINED
HEREIN,
EVEN
IF
AMD
IS
EXPRESSLY
ADVISED
OF
THE
POSSIBILITY
OF
SUCH
DAMAGES.

ATTRIBUTION

©
2013
Advanced
Micro
Devices,
Inc.
All
rights
reserved.
AMD,
the
AMD
Arrow
logo
and
combinaQons
thereof
are
trademarks
of
Advanced
Micro
Devices,

Inc.
in
the
United
States
and/or
other
jurisdicQons.

SPEC

is
a
registered
trademark
of
the
Standard
Performance
EvaluaQon
CorporaQon
(SPEC).
Other

names
are
for
informaQonal
purposes
only
and
may
be
trademarks
of
their
respecQve
owners.

22
|

PRESENTATION
TITLE

|

November
20,
2013

|

CONFIDENTIAL

PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the OpenJDK Graal infrastructure, by Vasanth Venkatachalam

More Related Content

Similar to PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the OpenJDK Graal infrastructure, by Vasanth Venkatachalam

More from AMD Developer Central

Recently uploaded

PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the OpenJDK Graal infrastructure, by Vasanth Venkatachalam