Performance of Go on Multicore Systems

Performance
of
Go

on
Mul/core

Systems

Huang
Yipeng

19th
November
2012

FYP
Presenta/on

1

Mo/va/on

•  Mul-core
systems
have
become
common

•  But
“dual,
quad-‐cores
are
not
useful
all
the
/me,

they
waste
baEeries...”
-‐
Stephen
Elop,
Nokia
CEO

2

Mo/va/on

•  Mul-core
systems
have
become
common

•  But
“dual,
quad-‐cores
are
not
useful
all
the
/me,

they
waste
baEeries...”
-‐
Stephen
Elop,
Nokia
CEO

•  Because
most
programs
are
explicitly
parallel

–  #Threads

–  #Cores

3

Mo/va/on:
Why
Go?

4

Objec/ve

•  To
study
the
parallelism
performance
of
Go,

compared
with
C,
using
measurements
and

analy-cal
models
(to
quan/fy
actual
and
predicted

performances
respec/vely)

5

Related
Work

•  Understanding
the
Oﬀ-‐chip
Memory
Conten/on
of

Parallel
Programs
in
Mul/core
Systems
(B.M.
Tudor,

Y.M.
Teo,
2011)

•  A
Prac/cal
Approach
for
Performance
Analysis
of

Shared
Memory
Programs
(B.M.
Tudor,
Y.M.
Teo,
2011)

6

Parallelism
of
Shared-‐memory
Program

Memory

Conten/on

Useful
Work

Data

Dependency

Related
Work:
Diﬀerences

7

Shared
Memory
Programs

Shared
Memory
Programs
Implicit

Parallelism

e.g.
Go

Explicit

Parallelism

e.g.
C
&
OpenMP

Processor
Architecture

Shared
Memory
Programs
Emerging

pladorms

e.g.
ARM

Mul/core

pladorms

e.g.
Intel,
AMD

Parallelism
Performance

Analy/cal

Models

Low
Memory

Conten/on

High
Memory

Conten/on

Contribu/ons

1.  Insights
about
the
parallelism
performance
of
Go

2.  Extend
our
analy/cal
parallelism
model
for

programs
with
lower
memory
conten/on

3.  Automate
performance
predic/on
and
model

valida/on
with
scripts

8

Outline

•  Mo/va/on

•  Related
Work

•  Methodology

–  Approach

–  Valida/on

•  Evalua/on

•  Conclusion

9

Process
Methodology

10

Analy/cal
Models

Baseline
Execu/ons

Parallelism
Traces
Parallelism
Traces

1.  Hardware

Counters

(Perf
Stat
3.0)

2.  Run
Queue

(Proc
Reader)

Parallelism

Predic/on

Go
Program

Analy/cal
Parallelism
Model

Parallelism
of
Shared-‐memory
Program:

m
threads,
n
cores

Number of Threads: m
Exploited Parallelism: π’Contention: M(n)
Memory

Conten/on

Useful
Work

Data

Dependency

11

Experimental
Setup:
Workloads

12

Non-‐Uniform
Memory
Access
(24
cores):
Dual
six-‐core
Intel
Xeon

X5650
2.67
GHz,
2
hardware
threads
per
core,
12MB
L3
cache,
16

GB
RAM,
running
Linux
Kernel
3.0

Experimental
Setup:
Machine

13

Outline

•  Mo/va/on

•  Related
Work

•  Methodology

–  Approach

–  Valida-on

•  Evalua/on

•  Conclusion

14

The
Memory
Conten/on
Model

SP
(Class
C)

15

9.7

Deﬁni-on:
Low
conten6on

problems
have
a
conten/on

≤
1.2

Observa-on:
Low
conten/on

problems
exhibt
a
W-‐like

paEern
not
captured
by
the

model.

Why
does
this
occur?

Valida/on
of
Memory
Cont.
Model

Mandelbrot

Fannkuck-‐Redux

Spectral
Norm

EP
(Class
C)

16

Original
Model:
Matrix
Mul

17

Modiﬁca/on
of
Memory
Cont.
Model

Model
revalidated...

1.  For
Matrix
Mul/plica/on
(down
from
50%
error
to
7%)

2.  For
other
low
conten/on
programs

3.  In
Go
and
C

4.  On
Intel
and
ARM
mul/cores

Revised
Model:
Matrix
Mul

Outline

•  Mo/va/on

•  Related
Work

•  Methodology

–  Approach

–  Valida/on

•  Evalua-on

•  Conclusion

18

Performance
analysis:
Go
vs
C

1.  How
much
poorer
is
Go
compared
to
C?
Why?

–  Run/me,
speedup
vs
#Cores

2.  Could
Go
outperform
C?

–  Run/me
vs
Problem
size

–  Run/me
vs
#Threads

3.  Predictability
of
actual
performance

–  Modeled
vs
Measured

–  Conten/on
vs
#Cores

–  Prob.
size
vs
Exp.
Parallelism
/
Data
Dep.
/
Conten/on

19

Points
of
Comparison

20

Unop/mized
Op/mized

Compiler
Op/miza/on
Programmer
Op/miza/on

Experiment
1

Matrix
Mul/plica/on
(4992*4992)

No
op/miza/on
ﬂags
(-‐N
for
Go)

#threads
=
24

Go
is
comparable
with
C

Points
of
Comparison

21

Unop/mized
Op/mized

Compiler
Op/miza/on
Programmer
Op/miza/on

Experiment
1

Matrix
Mul/plica/on
(4992*4992)

No
op/miza/on
ﬂags
(-‐N
for
Go)

#threads
=
24

Go
is
comparable
with
C

Experiment
2

Matrix
Mul/plica/on
(4992*4992)

-‐O3
op/miza/on
for
C,
No
ﬂag
for
Go

#threads
=
24

Go
is
marginally
worse
than
C

Points
of
Comparison

22

Unop/mized
Op/mized

Compiler
Op/miza/on
Programmer
Op/miza/on

Experiment
1

Matrix
Mul/plica/on
(4992*4992)

No
op/miza/on
flags
(-‐N
for
Go)

#threads
=
24

Experiment
2

Matrix
Mul/plica/on
(4992*4992)

-‐O3
op/miza/on
for
C,
No
flag
for
Go

#threads
=
24

Go
is
marginally
slowerthan
C

Experiment
3

Transposed
Matrix
Mul/plica/on
(4992*4992)

-‐O3
op/miza/on
for
C,
No
flag
for
Go

#threads
=
24

Go
is
much
worse
than
C

Observa-ons:

•  Sequen-al:
Go
is
16%
slower

•  Parallel:
Go
is
up
to
5%
faster

No
Op/miza/on:
Run/me
vs
#Cores

23

MatrixMul(#threads
=
24,
P
size
=
5K)

Eﬀect
of
#cores
on
run/me

MatrixMul(#threads
=
24,
P
size
=
5K)

Eﬀect
of
#cores
on
X
ra/o

Reasons

Observa-ons
(in
Go)

1.  Instruc-ons
executed:

12%
less

2.  #Cycles:

sequen/al
(16%
higher),

parallel
(5%
less)

3.  Cache
Misses:

sequen/al
(27x
worse),

parallel
(similar)

24

Conclusions

•  Go’s
poor
sequen/al
performance
caused

by
heavy
cache
miss
rate.
Likely
result
of

parallel
overhead.

Observa-ons:

•  Go
makes
up
for
poor
sequen/al
performance
with
a
higher
speedup.

•  Normalized
Go
speedup
is
marginally
beEer
(up
to
1.05x),
except
on
1/24
cores

(0.86x/0.97x)

No
Op/miza/on:
Parallelism
(Speedup)
vs
#Cores

25

MatrixMul(#threads
=
24,
P
size
=
5K)

Eﬀect
of
#cores
on
speedup

MatrixMul(#threads
=
24,
P
size
=
5K)

Eﬀect
of
#cores
on
norm.
speedup

(against
best
seq.
execu/on
/me)

Observa-ons:

•  Sequen-al:
Go
is
400%
slower

•  Parallel:
Go
is
180-‐340%
slower

Both
Op/miza/ons:
Run/me
vs
#Cores

26

MatrixMul
–O3(#threads
=
24,
P
size
=
5K)

Effect
of
#cores
on
run/me

MatrixMul
–O3(#threads
=
24,
P
size
=
5K)

Effect
of
#cores
on
X
difference

Reasons

27

Observa-ons
(in
Go)

1.  Instruc-ons
executed:

5.2x
as
many

2.  #Cycles:

sequen/al
(400%
higher),

parallel
(180%
higher)

3.  Cache
Misses:

sequen/al
(64%
less),

parallel
(56%
less)

Conclusions

•  Go’s
op-miza-on
not
as
mature
as
C’s

Sequen/al
instruc/ons
reduced
1.3x
vs
8x,
cycles

reduced
4x
vs
18x

•  Go
has
beVer
cache
management

Observa-ons:

•  Go
speedup
is
higher
than
C’s
on
its
own
base,
but
significantly
worse
when
normalized.

•  Secondary
Objec-ve:
Given
that
Go
has
a
higher
own-‐base
speedup,
could
it
beat
C
if
we

increase
the
problem
size?

Both
Op/miza/ons:
Parallelism
vs
#Cores

28

MatrixMul
–O3(#threads
=
24,
P
size
=
5K)

Effect
of
#cores
on
speedup

MatrixMul
–O3(#threads
=
24,
P
size
=
5K)

Effect
of
#cores
on
norm.
speedup

Observa-on:

•  Variance
in
the
/mes
ra/o
reduces
from
1.0-‐1.3
to
1.0-‐1.1

Conclusion:

•  In
general,
Go
is
increasingly
compe//ve
as
the
problem
size
increases.

Compiler
Op/miza/on:
Varying
Problem
Size

29

MatrixMul
–O3(#threads
=
24,
P
size
=
10K)

Effect
of
#cores
on
X
difference

MatrixMul
–O3(#threads
=
24,
P
size
=
5K)

Effect
of
#cores
on
X
difference

Both
Op/miza/ons:
Varying
Problem
Size

30

MatrixMul
–O3(#threads
=
24)
Eﬀect

of
problem
size,
#cores
on
/mes

diﬀerence

Observa-on:

•  The
/mes
ra/o
decreases
as
the

problem
size
increases
on
1-‐20

cores.

Conclusion:

•  There
is
a
valley
of
performance

on
intermediate
core
numbers.

Both
Op/miza/ons:

Run/me
vs
#threads

31

Observa-on:

•  Go’s
rela/ve
performance
as
the

#threads
increases.

Conclusions:

•  The
cost
of
gorou/nes
in
Go
is

extremely
low.

•  Go’s
performance
may
improve
on

problems
with
high
data
dependency.

MatrixMul
(#cores=
24,
Problem
size
=
5K)

Eﬀect
of
#threads
on
run/me

Predictability
of
Actual
Performance

•  Objec-ve:
To
determine
how
Go
compares
to
C
with

regard
to
mul/core
predictability
as
we
change
the

#cores,
#threads,
problem
size

•  Observa-ons
(in
Go):

–  Model
exhibits
beEer
accuracy

–  Memory
Conten/on
does
not
ﬂuctuate
as
#cores
changes

–  Measurements
consistent
with
assump/ons
as
problem
size

changes

•  Result:
Go
exhibts
proper/es
useful
for
predic/on
that

C
does
not.

32

Observa-ons

•  Conten/on
Error

–  C

(Avg:
15%,
Max:
55%
)

–  Go
(Avg:
3%,
Max:
14%)

•  Parallelism
Error

–  C

(Avg:
18%,
Max:
44%)

–  Go
(Avg:
6%,
Max:
15%)

•  Run/me
Error

– 
C
(Avg:
16%,
Max:
47%)

–  Go
(Avg:
5%,
Max:
13%)

Conclusion

•  Go
has
a
beEer
predictability
than
C

Predictability
of
Performance

Modeled
vs
Measured

33

MatrixMul
–O3(#threads
=
24,
P=17K)

Eﬀect
of
#cores
on
conten/on
factor

Observa-ons

•  In
C
,
conten/on
ﬂucuates
(0-‐5.6)

•  Not
so
much
in
Go
(0-‐1)

Conclusion

•  Garbage
Collec/on,
Channel
U/l

•  A
conten/on
factor
can
be
easily

bounded
in
Go
to
guarantee

performance
of
some
other
program.

Predictability
of
Performance

Conten/on
vs
#Cores

34

MatrixMul
–O3(#threads
=
24,
P=17K)

Eﬀect
of
#cores
on
conten/on
factor

Predictability
of
Performance

Modeling
across
problem
sizes

•  Objec-ve:
Can
we
perform
measurements
on

smaller
problem
sizes
to
reduce
run/me
of

parallelism
predic/on?

35

Predictability
of
Performance

Problem
size
vs
Exploit.
Parallelism

36

Go
MatrixMul
(#threads
=
24,
P=17K)

Eﬀect
of
problem
size
on
exploited

parallelsim

C
MatrixMul
(#threads
=
24,
P=17K)

Eﬀect
of
problem
size
on
exploited

parallelsim

Observa-ons
(in
Go)

•  Exploited
Parallelism
only
decreases
slightly
as
problem
size
increases

Predictability
of
Performance

Problem
size
vs
Data
Dependency

37

Go
MatrixMul
(#threads
=
24,
P=17K)

Eﬀect
of
problem
size
on
exploited

parallelsim

C
MatrixMul
(#threads
=
24,
P=17K)

Eﬀect
of
problem
size
on
exploited

parallelsim

Observa-ons
(in
Go)

•  Data
Dependency
decreases
as
expected
as
problem
size
increases

Predictability
of
Performance

Problem
size
vs
Conten/on

38

Go
MatrixMul
(#threads
=
24,
P=17K)

Eﬀect
of
problem
size
on
exploited

parallelsim

C
MatrixMul
(#threads
=
24,
P=17K)

Eﬀect
of
problem
size
on
exploited

parallelsim

Observa-ons
(in
Go)

•  Memory
conten/on
only
increases
slightly
as
problem
size
increases

Conclusion:

•  Measurements
inputs
on
small
problems
are
more
accurate
in
Go
than
in
C

Conclusion

1.  How
does
Go
compare
to
C
in
a
mul-core
environment?

Go’s
Actual
Performance

–  Comparable
performance
before,
Inferior
performance
aver
programmer

op/miza/on

–  Consequence
of
diﬀerent
levels
of
op/miza/on

–  Performance
margin
decreases
as
the
problem
size
increases
on
intermediate

core
numbers

–  Cost
of
gorou/nes
much
lower
than
threads

Go’s
Predicted
Performance

–  Model
exhibits
beEer
accuracy

–  Memory
Conten/on
does
not
ﬂuctuate
as
#cores
changes

–  Measurements
consistent
with
assump/ons
as
problem
size
changes

39

Conclusion

2.  Is
the
model
extensible
beyond
C,
tradi-onal

mul-cores,
and
high
conten-on?

–  Modiﬁed
/
Validated
for
low
conten/on
problems

–  Validated
for
the
Go
language

–  Validated
for
ARM
devices

3.  Can
we
make
the
model
easier
to
use?

–  Formally
deﬁned
valida/on
criteria

–  Wrote
script
to
perform
model
valida/on

–  Wrote
script
to
perform
performance
predic/on

–  *Future
Work*
Front
end
for
predic/on

40

Observa-ons:

•  Sequen-al:
Go
is
31%
slower

•  Parallel:
Go
is
up
to
0-‐28%
slower

•  On
UMA,
/mes
ra/o
decreases
as
#cores
increases

Compiler
Op/miza/on:
Run/me
vs
#Cores

41

MatrixMul
–O3
(#threads
=
24,
P
size
=
5K)

Effect
of
#cores
on
run/me

MatrixMul
–O3
(#threads
=
24,
P
size
=
5K)

Effect
of
#cores
on
X
difference

Reasons

42

Observa-ons
(in
Go)

1.  Instruc-ons
executed:

4.5x
as
many

2.  #Cycles:

sequen/al
(30%
higher),

parallel
(similar)

3.  Cache
Misses:

sequen/al
(10%
higher),

parallel
(46%
less)

Observa-ons:

•  Go
speedup
is
higher
than
C’s
on
its
own
base,
but
lower
when
normalized.

•  Secondary
Objec-ve:
Given
that
Go
has
a
higher
own-‐base
speedup,
could
it
beat

C
if
we
increase
the
problem
size?

Compiler
Op/miza/on:
Parallelism
vs
#Cores

43

MatrixMul
–O3(#threads
=
24,
P
size
=
5K)

Eﬀect
of
#cores
on
Exp.
Parallelism

MatrixMul
–O3(#threads
=
24,
P
size
=
5K)

Eﬀect
of
#cores
on
norm.
speedup

Sequen/al
Op/miza/on

44

No
op/miza/on

Compiler
op/miza/on

Compiler
+
Programmer
op/miza/on

Predictability
of
Performance

Modeling
across
problem
sizes

•  Objec-ve:
Can
we
perform
measurements
on
smaller

problem
sizes
to
reduce
run/me
of
parallelism

predic/on?

•  Observa-on:
The
performance
proﬁles
in
Go
are

consistent
with
expecta/ons
as
problem
size
changes

•  Result:

Measurements
inputs
on
small
problems
are

more
accurate
in
Go
than
in
C

45

Performance of Go on Multicore Systems

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to Performance of Go on Multicore Systems

Similar to Performance of Go on Multicore Systems (20)

Recently uploaded

Recently uploaded (20)

Performance of Go on Multicore Systems