SC08_talk_final_handouts

Improving
Throughput
of
Simultaneous

Mul6threading
(SMT)
Processors
using

Applica6on
Signatures
and
Thread
Priori6es

Mitesh
R.
Meswani

University
of
Texas
at
El
Paso
(UTEP)

11/20/2008
1
By
Mitesh
R.
Meswani

Simultaneous
Mul6threading
(SMT)

U6liza6on

Thread-‐X

Execu6ng

Thread-‐Y

Execu6ng

No
Thread

Execu6ng
Legend:

1

2

3

4

5

6

FP

FX

LSU

Processor
Cycles

Single-‐Threaded
ExecuDon

1

2

3

4

5

6

FP

FX

LSU

Processor
Cycles

SMT
ExecuDon

ExecuDon

Units

ExecuDon

Units

SMT
with
two
hardware
threads

•  SMT
hardware
contexts
share
most
of
the
processor
resources

•  Poten7al
of
2x
throughput
with
perfect
resource
sharing

•  Throughput
gains
limited
by
conten7on
of
shared
resources

Thread

X
waits

un6l
resource
is

free,
due
to
sharing

Thread

X
uses

unused
resource

2
By
Mitesh
R.
Meswani
11/20/2008

Research
Ques6on
and
Hypothesis

•  SMT-‐performance
Tunables:

– Enable
or
disable
SMT
mode

– Priori6ze
one
hardware
thread
over
the
other

•  Research
QuesDon:
What
are
the
op6mal

priority
seWngs
for
best
processor
throughput?

•  Hypothesis:
Use
hints
from
resource
usage
in

Single-‐threaded
mode

3
By
Mitesh
R.
Meswani
11/20/2008

Disserta6on
Contribu6ons

1.  Showed
that
priori6za6on
of
threads
improves

throughput
for
nearly
half
the
applica6ons
studied

2.  Deﬁned
and
captured
applica6on
“signatures”
which

characterize
usage
of
cri6cal
resources

3.  Showed
that
only
a
small
set
of
signatures
are

present
in
real
world
applica6ons

4.  Developed
a
predic6on
methodology
using
signature

microbenchmarks
and
showed
that
our
predic6ons

improve
throughput
over
no
priori6za6on
(default)

4
By
Mitesh
R.
Meswani
11/20/2008

Experimental
Pla^orm:
Thread

Priori6es
in
IBM
POWER5

•  Six
out
of
eight
priori6es
available
to
the

opera6ng
system

for
normal
mode
of
opera6on:

1,
2,
3,
4
(default),

5,
and
6

•  Diﬀerence
in
hardware
thread
priori6es
control

decode
cycle
sharing

– Higher
Priority
thread
gets
more
decode
cycles

– Equal
Priori6es
(default)
gives
one
out
of
two
decode

cycles
to
each
thread

5
By
Mitesh
R.
Meswani
11/20/2008

Signatures

1. Iden6fy
Significant
Resources
:
Floa6ng-‐point
unit
(FPU),

Fixed-‐point
unit
(FXU),
L2
unified
cache,
and
L2
unified
TLB

2. Capture
u6liza6on
of
resources
using
performance

counters

3. Define
u6liza6on
levels
of
resources
in
Single-‐Threaded

mode,
forming
a
signature

–  Ten
u6liza6on
levels
L1
to
L10
per
resource

–  Example:
L1L2L3L9,
L9L6L7L8,
L2L3L10L6…

6
By
Mitesh
R.
Meswani
11/20/2008

Work
Flow

Performance

Counter

SeWngs

Step
1:
Find
Signatures
of
Real
Applica6ons

Run
Applica6on
and

Periodically
Sample

Counters

Serial
Applica6on

Single-‐
Threaded

Mode

Signature
Data

Base

Signatures

Signature-‐microbenchmark
Pair
X,
Y

CPI

Step
2:
Create
Signature
Microbenchmarks
for

Frequently
Appearing
Signatures
and
Empirically

Find
Priority
Predic6ons

Run
Signature-‐
Microbenchmark

Pair

Priori6es

i,
j

in
SMT

Mode

Predic6on
Data

Base

Store
CPI

for
all

priori6es
for

Pair
X,
Y

Iden6fy
Best

Case
Priority
for

Pair
X,
Y

Predic6ons

Step
3:
Execute
Applica6on
Pairs
using

Predicted
Priori6es

Signature

Data
Base

Predic6on

Data
Base

Read
Signatures

Applica6on
Pair
A,
B

Read
Priori6es

Yes

Signature
of
A,B

Run
Pair
A,
B

with

Predicted

Priori6es
in
SMT

Mode

Priority
of
A,

Priority
of
B

Found

Domina6ng

Signatures
?

Run
Pair
A,
B

with
Equal

Priori6es
in

SMT
Mode

No

7
By
Mitesh
R.
Meswani
11/20/2008

Details
of
Step
1

•  Four
groups
of
counters
were
measured

•  Each
group
measured
in
separate
runs

•  Sampled
in
one
second
6me
intervals

• Signature
of
an
interval
is
composed
from
u6liza6on
for
that
interval
from
4
runs

Interval
0

0

1

2

3

4

5

6

7

8

9

10

11
12
13

14
15

16
17

18
19

20
21

Sample#

Run
1

Run
2

Run
3

Run
4

8
By
Mitesh
R.
Meswani
11/20/2008

Diﬀerent
Signatures
are
Present
in
Real
Applica6ons

(SPEC
CPU2006,
NAS
NPB
SER,
PETSc
KSP/Matrix)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

429.mcf
416.gamess
444.namd
462.libquantum
cgs
gmres

L1L1L1L1

L3L1L1L1

L3L2L1L1

L2L1L1L1

L2L3L1L1

L2L2L1L1

L1L4L1L1

L1L1L9L5

L1L2L7L4

L1L1L7L4

L1L1L6L4

L1L2L6L3

L1L2L5L2

L1L3L1L1

L1L2L2L1

L1L2L3L1

L1L2L6L4

L1L2L5L4

L1L2L5L3

L1L2L4L3

L1L2L4L2

L1L2L3L2

L1L1L2L1

L1L2L1L1

%
of
Total
Cycles

Signature
Histogram
of
Four
SPEC
CPU2006
and
Two
PETSc
KSP
Library
FuncDons

ApplicaDons

One
Signature
>
80%
(dominant)

9
By
Mitesh
R.
Meswani
11/20/2008

Conclusions

1.  Showed
that
equal
priori6es
(default)
are
not
the
best

for
nearly
47%
of
applica6ons
studied

2.  Only
16
Signatures
are
suﬃcient
to
represent
95.5%
of

execu6on
6me
of
20
SPEC
CPU2006
benchmarks,
9
NAS

NPB3.2
Serial
benchmarks,

119
PETSc
KSP,
and
180

PETSc
Matrix
libraries

3.  Priority
predic6ons
using
signature
benchmarks

improve
throughput
over
default
seWngs
for
87%
of
the

15
PETSc
KSP
coschedules.

10
By
Mitesh
R.
Meswani
11/20/2008

Applica6ons
with
Mul6ple
Signatures

11
By
Mitesh
R.
Meswani
11/20/2008

DisDnct
TransiDons
DisDnct
TransiDons

Long
Phases
RepeaDng
Small
Phases

Future
Work
and
References

Future
Work:

•  Iden6fy
applica6ons
with
mul6ple
signatures

•  Dynamic
adapta6on
of
priori6es

•  Detec6ng
signatures
on
the
ﬂy

•  Phase
detec6on
and
Predic6on
for
a
truly
adap6ve
system

References:

•  M.
R.
Meswani,
P.
J.
Teller,
and
S.
Arunangiri.,
“A
Study
of
the
Inﬂuence

of
the
POWER5
Dynamic
Resource
Balancing
Hardware
on
Op6mal

Hardware
Thread
Priori6es,”
To
Appear
in
the
Proceedings
of
the
2008

Live
Virtual
Construc=ve
Conference,
Jan
2009,
El
Paso,
TX

•  M.
R.
Meswani
and
P.
J.
Teller,
“
Evalua6ng
the
Performance
Impact
of

Hardware
Thread
Priori6es
in
Simultaneous
Mul6threaded
Processors

using
SPEC
CPU2000,”
Proceedings
of
the
2nd
Interna=onal
Workshop

on
Opera=ng
Systems
Interference
In
High
Performance
Applica=ons,
in

conjunc6on
with
the
15th
Interna6onal
Conferences
on
Parallel

Architectures
and
Compila6on
Techniques
(PACT06)
Conference,

sponsored
by
ACM
and
IEEE,
September
2006,
Seaqle,
WA.

12
By
Mitesh
R.
Meswani
11/20/2008

Acknowledgements

•  This
work
is
supported
by
AHPCRC
Grant

W11NF-‐07-‐2-‐2007

•  Dr.
Patricia
J.
Teller,
Professor,
UTEP
(Advisor)

•  Amir
Simon,
IBM
for
assistance
with
p550

machine

•  Email:
mitesh.meswani@gmail.com

•  URL:
www.linkedin.com/in/miteshmeswani

11/20/2008
By
Mitesh
R.
Meswani
13

SC08_talk_final_handouts

Recommended

Recommended

More Related Content

Similar to SC08_talk_final_handouts

Similar to SC08_talk_final_handouts (20)

SC08_talk_final_handouts