SlideShare a Scribd company logo
1 of 13
Download to read offline
Improving	
  Throughput	
  of	
  Simultaneous	
  
Mul6threading	
  (SMT)	
  Processors	
  using	
  
Applica6on	
  Signatures	
  and	
  Thread	
  Priori6es	
  
Mitesh	
  R.	
  Meswani	
  
University	
  of	
  Texas	
  at	
  El	
  Paso	
  (UTEP)	
  
11/20/2008	
   1	
  By	
  Mitesh	
  R.	
  Meswani	
  
Simultaneous	
  Mul6threading	
  (SMT)	
  
U6liza6on	
  
Thread-­‐X	
  	
  
Execu6ng	
  
Thread-­‐Y	
  	
  
Execu6ng	
  
No	
  Thread	
  
Execu6ng	
  Legend:	
  
	
  	
  1	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  2	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  3	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4 	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  5	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  6	
  
FP	
  
FX	
  
LSU	
  
Processor	
  Cycles	
  
Single-­‐Threaded	
  ExecuDon	
  
	
  1	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  2	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  3	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4 	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  5	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  6	
  
FP	
  
FX	
  
LSU	
  
Processor	
  Cycles	
  
SMT	
  ExecuDon	
  
ExecuDon	
  
Units	
  	
  
ExecuDon	
  
Units	
  	
  
SMT	
  with	
  two	
  hardware	
  threads	
  
•  SMT	
  hardware	
  contexts	
  share	
  most	
  of	
  the	
  processor	
  resources	
  
•  Poten7al	
  of	
  2x	
  throughput	
  with	
  perfect	
  resource	
  sharing	
  
•  Throughput	
  gains	
  limited	
  by	
  conten7on	
  of	
  shared	
  resources	
  
Thread	
  	
  X	
  waits	
  
un6l	
  resource	
  is	
  
free,	
  due	
  to	
  sharing	
  
Thread	
  	
  X	
  uses	
  
unused	
  resource	
  
2	
  By	
  Mitesh	
  R.	
  Meswani	
  11/20/2008	
  
Research	
  Ques6on	
  and	
  Hypothesis	
  
•  SMT-­‐performance	
  Tunables:	
  
– Enable	
  or	
  disable	
  SMT	
  mode	
  
– Priori6ze	
  one	
  hardware	
  thread	
  over	
  the	
  other	
  
	
  
•  Research	
  QuesDon:	
  What	
  are	
  the	
  op6mal	
  
priority	
  seWngs	
  for	
  best	
  processor	
  throughput?	
  
	
  
•  Hypothesis:	
  Use	
  hints	
  from	
  resource	
  usage	
  in	
  
Single-­‐threaded	
  mode	
  
3	
  By	
  Mitesh	
  R.	
  Meswani	
  11/20/2008	
  
Disserta6on	
  Contribu6ons	
  
1.  Showed	
  that	
  priori6za6on	
  of	
  threads	
  improves	
  
throughput	
  for	
  nearly	
  half	
  the	
  applica6ons	
  studied	
  
2.  Defined	
  and	
  captured	
  applica6on	
  “signatures”	
  which	
  
characterize	
  usage	
  of	
  cri6cal	
  resources	
  
3.  Showed	
  that	
  only	
  a	
  small	
  set	
  of	
  signatures	
  are	
  
present	
  in	
  real	
  world	
  applica6ons	
  
4.  Developed	
  a	
  predic6on	
  methodology	
  using	
  signature	
  
microbenchmarks	
  and	
  showed	
  that	
  our	
  predic6ons	
  
improve	
  throughput	
  over	
  no	
  priori6za6on	
  (default)	
  
4	
  By	
  Mitesh	
  R.	
  Meswani	
  11/20/2008	
  
Experimental	
  Pla^orm:	
  Thread	
  
Priori6es	
  in	
  IBM	
  POWER5	
  	
  
•  Six	
  out	
  of	
  eight	
  priori6es	
  available	
  to	
  the	
  
opera6ng	
  system	
  	
  for	
  normal	
  mode	
  of	
  opera6on:	
  
1,	
  2,	
  3,	
  4	
  (default),	
  	
  5,	
  and	
  6	
  
•  Difference	
  in	
  hardware	
  thread	
  priori6es	
  control	
  
decode	
  cycle	
  sharing	
  
– Higher	
  Priority	
  thread	
  gets	
  more	
  decode	
  cycles	
  
– Equal	
  Priori6es	
  (default)	
  gives	
  one	
  out	
  of	
  two	
  decode	
  
cycles	
  to	
  each	
  thread	
  
	
  
5	
  By	
  Mitesh	
  R.	
  Meswani	
  11/20/2008	
  
Signatures	
  
1. Iden6fy	
  Significant	
  Resources	
  :	
  Floa6ng-­‐point	
  unit	
  (FPU),	
  
Fixed-­‐point	
  unit	
  (FXU),	
  L2	
  unified	
  cache,	
  and	
  L2	
  unified	
  TLB	
  	
  
2. Capture	
  u6liza6on	
  of	
  resources	
  using	
  performance	
  
counters	
  
3. Define	
  u6liza6on	
  levels	
  of	
  resources	
  in	
  Single-­‐Threaded	
  
mode,	
  forming	
  a	
  signature	
  
–  Ten	
  u6liza6on	
  levels	
  L1	
  to	
  L10	
  per	
  resource	
  
–  Example:	
  L1L2L3L9,	
  L9L6L7L8,	
  L2L3L10L6…	
  
6	
  By	
  Mitesh	
  R.	
  Meswani	
  11/20/2008	
  
Work	
  Flow	
  
Performance	
  
Counter	
  
SeWngs	
  	
  
Step	
  1:	
  Find	
  Signatures	
  of	
  Real	
  Applica6ons	
  
Run	
  Applica6on	
  and	
  
Periodically	
  Sample	
  	
  
Counters	
  	
  
Serial	
  Applica6on	
  
Single-­‐
Threaded	
  
Mode	
  
Signature	
  Data	
  
Base	
  
Signatures	
  
Signature-­‐microbenchmark	
  Pair	
  X,	
  Y	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
CPI	
  
Step	
  2:	
  Create	
  Signature	
  Microbenchmarks	
  for	
  
Frequently	
  Appearing	
  Signatures	
  and	
  Empirically	
  
Find	
  Priority	
  Predic6ons	
  	
  
Run	
  Signature-­‐
Microbenchmark	
  
Pair	
  
Priori6es	
  	
  
	
  i,	
  j	
  	
  in	
  SMT	
  	
  
Mode	
  
Predic6on	
  Data	
  
Base	
  
Store	
  CPI	
  	
  for	
  all	
  
priori6es	
  for	
  
Pair	
  X,	
  Y	
  	
  	
  
Iden6fy	
  Best	
  
Case	
  Priority	
  for	
  
Pair	
  X,	
  Y	
  
Predic6ons	
  
Step	
  3:	
  Execute	
  Applica6on	
  Pairs	
  using	
  	
  
Predicted	
  Priori6es	
  
Signature	
  
Data	
  Base	
  
Predic6on	
  
Data	
  Base	
  
Read	
  Signatures	
  
Applica6on	
  Pair	
  A,	
  B	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
Read	
  Priori6es	
  
Yes	
  	
  	
  	
  	
  	
  Signature	
  of	
  A,B	
  
Run	
  Pair	
  A,	
  B	
  	
  with	
  
Predicted	
  	
  
Priori6es	
  in	
  SMT	
  
Mode	
  
Priority	
  of	
  A,	
  
Priority	
  of	
  B	
  
Found	
  	
  
Domina6ng	
  
Signatures	
  ?	
  
Run	
  Pair	
  A,	
  B	
  
with	
  Equal	
  
Priori6es	
  in	
  
SMT	
  Mode	
  
No	
  
7	
  By	
  Mitesh	
  R.	
  Meswani	
  11/20/2008	
  
Details	
  of	
  Step	
  1	
  
•  Four	
  groups	
  of	
  counters	
  were	
  measured	
  
•  Each	
  group	
  measured	
  in	
  separate	
  runs	
  
•  Sampled	
  in	
  one	
  second	
  6me	
  intervals	
  
• Signature	
  of	
  an	
  interval	
  is	
  composed	
  from	
  u6liza6on	
  for	
  that	
  interval	
  from	
  4	
  runs	
  	
  
Interval	
  0	
  
0	
  	
  1	
  	
  	
  	
  2	
  	
  	
  	
  3	
  	
  	
  	
  4	
  	
  	
  5	
  	
  	
  	
  6	
  	
  	
  	
  7	
  	
  	
  8	
  	
  	
  	
  9	
  	
  	
  10	
  	
  11	
  12	
  13	
  	
  14	
  15	
  	
  16	
  17	
  	
  18	
  19	
  	
  20	
  21	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Sample#	
  	
  
Run	
  1	
  
	
  
Run	
  2	
  
	
  
Run	
  3	
  
	
  
Run	
  4	
  
8	
  By	
  Mitesh	
  R.	
  Meswani	
  11/20/2008	
  
Different	
  Signatures	
  are	
  Present	
  in	
  Real	
  Applica6ons	
  
(SPEC	
  CPU2006,	
  NAS	
  NPB	
  SER,	
  PETSc	
  KSP/Matrix)	
  	
  
0%	
  
10%	
  
20%	
  
30%	
  
40%	
  
50%	
  
60%	
  
70%	
  
80%	
  
90%	
  
100%	
  
429.mcf	
   416.gamess	
   444.namd	
   462.libquantum	
   cgs	
   gmres	
  
L1L1L1L1	
  
L3L1L1L1	
  
L3L2L1L1	
  
L2L1L1L1	
  
L2L3L1L1	
  
L2L2L1L1	
  
L1L4L1L1	
  
L1L1L9L5	
  
L1L2L7L4	
  
L1L1L7L4	
  
L1L1L6L4	
  
L1L2L6L3	
  
L1L2L5L2	
  
L1L3L1L1	
  
L1L2L2L1	
  
L1L2L3L1	
  
L1L2L6L4	
  
L1L2L5L4	
  
L1L2L5L3	
  
L1L2L4L3	
  
L1L2L4L2	
  
L1L2L3L2	
  
L1L1L2L1	
  
L1L2L1L1	
  
%	
  of	
  Total	
  Cycles	
  
Signature	
  Histogram	
  of	
  Four	
  SPEC	
  CPU2006	
  and	
  Two	
  PETSc	
  KSP	
  Library	
  FuncDons	
  
ApplicaDons	
  
One	
  Signature	
  >	
  80%	
  (dominant)	
  	
  
9	
  By	
  Mitesh	
  R.	
  Meswani	
  11/20/2008	
  
Conclusions	
  
1.  Showed	
  that	
  equal	
  priori6es	
  (default)	
  are	
  not	
  the	
  best	
  
for	
  nearly	
  47%	
  of	
  applica6ons	
  studied	
  
2.  Only	
  16	
  Signatures	
  are	
  sufficient	
  to	
  represent	
  95.5%	
  of	
  
execu6on	
  6me	
  of	
  20	
  SPEC	
  CPU2006	
  benchmarks,	
  9	
  NAS	
  
NPB3.2	
  Serial	
  benchmarks,	
  	
  119	
  PETSc	
  KSP,	
  and	
  180	
  
PETSc	
  Matrix	
  libraries	
  
3.  Priority	
  predic6ons	
  using	
  signature	
  benchmarks	
  
improve	
  throughput	
  over	
  default	
  seWngs	
  for	
  87%	
  of	
  the	
  
15	
  PETSc	
  KSP	
  coschedules.	
  
10	
  By	
  Mitesh	
  R.	
  Meswani	
  11/20/2008	
  
Applica6ons	
  with	
  Mul6ple	
  Signatures	
  
11	
  By	
  Mitesh	
  R.	
  Meswani	
  11/20/2008	
  
DisDnct	
  TransiDons	
   DisDnct	
  TransiDons	
  
Long	
  Phases	
   RepeaDng	
  Small	
  Phases	
  
Future	
  Work	
  and	
  References	
  
Future	
  Work:	
  
•  Iden6fy	
  applica6ons	
  with	
  mul6ple	
  signatures	
  
•  Dynamic	
  adapta6on	
  of	
  priori6es	
  
•  Detec6ng	
  signatures	
  on	
  the	
  fly	
  
•  Phase	
  detec6on	
  and	
  Predic6on	
  for	
  a	
  truly	
  adap6ve	
  system	
  
	
  
References:	
  
•  M.	
  R.	
  Meswani,	
  P.	
  J.	
  Teller,	
  and	
  S.	
  Arunangiri.,	
  “A	
  Study	
  of	
  the	
  Influence	
  
of	
  the	
  POWER5	
  Dynamic	
  Resource	
  Balancing	
  Hardware	
  on	
  Op6mal	
  
Hardware	
  Thread	
  Priori6es,”	
  To	
  Appear	
  in	
  the	
  Proceedings	
  of	
  the	
  2008	
  
Live	
  Virtual	
  Construc=ve	
  Conference,	
  Jan	
  2009,	
  El	
  Paso,	
  TX	
  	
  
•  M.	
  R.	
  Meswani	
  and	
  P.	
  J.	
  Teller,	
  “	
  Evalua6ng	
  the	
  Performance	
  Impact	
  of	
  
Hardware	
  Thread	
  Priori6es	
  in	
  Simultaneous	
  Mul6threaded	
  Processors	
  
using	
  SPEC	
  CPU2000,”	
  Proceedings	
  of	
  the	
  2nd	
  Interna=onal	
  Workshop	
  
on	
  Opera=ng	
  Systems	
  Interference	
  In	
  High	
  Performance	
  Applica=ons,	
  in	
  
conjunc6on	
  with	
  the	
  15th	
  Interna6onal	
  Conferences	
  on	
  Parallel	
  
Architectures	
  and	
  Compila6on	
  Techniques	
  (PACT06)	
  Conference,	
  
sponsored	
  by	
  ACM	
  and	
  IEEE,	
  September	
  2006,	
  Seaqle,	
  WA.	
  
12	
  By	
  Mitesh	
  R.	
  Meswani	
  11/20/2008	
  
Acknowledgements	
  
•  This	
  work	
  is	
  supported	
  by	
  AHPCRC	
  Grant	
  
W11NF-­‐07-­‐2-­‐2007	
  
•  Dr.	
  Patricia	
  J.	
  Teller,	
  Professor,	
  UTEP	
  (Advisor)	
  	
  
•  Amir	
  Simon,	
  IBM	
  for	
  assistance	
  with	
  p550	
  
machine	
  
•  Email:	
  mitesh.meswani@gmail.com	
  
•  URL:	
  www.linkedin.com/in/miteshmeswani	
  
11/20/2008	
   By	
  Mitesh	
  R.	
  Meswani	
   13	
  

More Related Content

Similar to SC08_talk_final_handouts

Sc08 Talk Final
Sc08 Talk FinalSc08 Talk Final
Sc08 Talk Finalmrmeswani
 
Computer Applications: An International Journal (CAIJ)
Computer Applications: An International Journal (CAIJ)Computer Applications: An International Journal (CAIJ)
Computer Applications: An International Journal (CAIJ)caijjournal
 
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...caijjournal
 
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...caijjournal
 
The Impact of Systematic Edits in History Slicing
The Impact of Systematic Edits in History SlicingThe Impact of Systematic Edits in History Slicing
The Impact of Systematic Edits in History SlicingShinpei Hayashi
 
Junhua wang ai_next_con
Junhua wang ai_next_conJunhua wang ai_next_con
Junhua wang ai_next_conJunhua Wang
 
Deep Learning Inference at speed and scale
Deep Learning Inference at speed and scaleDeep Learning Inference at speed and scale
Deep Learning Inference at speed and scaleBill Liu
 
2007 Tidc India Profiling
2007 Tidc India Profiling2007 Tidc India Profiling
2007 Tidc India Profilingdanrinkes
 
Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismSummary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismFarwa Ansari
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkDatabricks
 
The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017Roman Katerinenko
 
Data Volume Compression Using BIST to get Low-Power Pseudorandom Test Pattern...
Data Volume Compression Using BIST to get Low-Power Pseudorandom Test Pattern...Data Volume Compression Using BIST to get Low-Power Pseudorandom Test Pattern...
Data Volume Compression Using BIST to get Low-Power Pseudorandom Test Pattern...IJMTST Journal
 
An Overview of Intel TFLOPS Super Computer
An Overview of Intel TFLOPS Super ComputerAn Overview of Intel TFLOPS Super Computer
An Overview of Intel TFLOPS Super ComputerSerwer Alam
 

Similar to SC08_talk_final_handouts (20)

Sc08 Talk Final
Sc08 Talk FinalSc08 Talk Final
Sc08 Talk Final
 
LEXICAL ANALYZER
LEXICAL ANALYZERLEXICAL ANALYZER
LEXICAL ANALYZER
 
Computer Applications: An International Journal (CAIJ)
Computer Applications: An International Journal (CAIJ)Computer Applications: An International Journal (CAIJ)
Computer Applications: An International Journal (CAIJ)
 
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
 
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...
 
The Impact of Systematic Edits in History Slicing
The Impact of Systematic Edits in History SlicingThe Impact of Systematic Edits in History Slicing
The Impact of Systematic Edits in History Slicing
 
Generator of pseudorandom sequences
Generator of pseudorandom sequences Generator of pseudorandom sequences
Generator of pseudorandom sequences
 
I0343047049
I0343047049I0343047049
I0343047049
 
Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
 
Junhua wang ai_next_con
Junhua wang ai_next_conJunhua wang ai_next_con
Junhua wang ai_next_con
 
The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017
 
Deep Learning Inference at speed and scale
Deep Learning Inference at speed and scaleDeep Learning Inference at speed and scale
Deep Learning Inference at speed and scale
 
2007 Tidc India Profiling
2007 Tidc India Profiling2007 Tidc India Profiling
2007 Tidc India Profiling
 
Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismSummary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
 
cug2011-praveen
cug2011-praveencug2011-praveen
cug2011-praveen
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache Spark
 
The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017The DEBS Grand Challenge 2017
The DEBS Grand Challenge 2017
 
Data Volume Compression Using BIST to get Low-Power Pseudorandom Test Pattern...
Data Volume Compression Using BIST to get Low-Power Pseudorandom Test Pattern...Data Volume Compression Using BIST to get Low-Power Pseudorandom Test Pattern...
Data Volume Compression Using BIST to get Low-Power Pseudorandom Test Pattern...
 
An Overview of Intel TFLOPS Super Computer
An Overview of Intel TFLOPS Super ComputerAn Overview of Intel TFLOPS Super Computer
An Overview of Intel TFLOPS Super Computer
 

SC08_talk_final_handouts

  • 1. Improving  Throughput  of  Simultaneous   Mul6threading  (SMT)  Processors  using   Applica6on  Signatures  and  Thread  Priori6es   Mitesh  R.  Meswani   University  of  Texas  at  El  Paso  (UTEP)   11/20/2008   1  By  Mitesh  R.  Meswani  
  • 2. Simultaneous  Mul6threading  (SMT)   U6liza6on   Thread-­‐X     Execu6ng   Thread-­‐Y     Execu6ng   No  Thread   Execu6ng  Legend:      1                                    2                                  3                                  4                      5                                6   FP   FX   LSU   Processor  Cycles   Single-­‐Threaded  ExecuDon    1                                    2                                  3                                  4                      5                                6   FP   FX   LSU   Processor  Cycles   SMT  ExecuDon   ExecuDon   Units     ExecuDon   Units     SMT  with  two  hardware  threads   •  SMT  hardware  contexts  share  most  of  the  processor  resources   •  Poten7al  of  2x  throughput  with  perfect  resource  sharing   •  Throughput  gains  limited  by  conten7on  of  shared  resources   Thread    X  waits   un6l  resource  is   free,  due  to  sharing   Thread    X  uses   unused  resource   2  By  Mitesh  R.  Meswani  11/20/2008  
  • 3. Research  Ques6on  and  Hypothesis   •  SMT-­‐performance  Tunables:   – Enable  or  disable  SMT  mode   – Priori6ze  one  hardware  thread  over  the  other     •  Research  QuesDon:  What  are  the  op6mal   priority  seWngs  for  best  processor  throughput?     •  Hypothesis:  Use  hints  from  resource  usage  in   Single-­‐threaded  mode   3  By  Mitesh  R.  Meswani  11/20/2008  
  • 4. Disserta6on  Contribu6ons   1.  Showed  that  priori6za6on  of  threads  improves   throughput  for  nearly  half  the  applica6ons  studied   2.  Defined  and  captured  applica6on  “signatures”  which   characterize  usage  of  cri6cal  resources   3.  Showed  that  only  a  small  set  of  signatures  are   present  in  real  world  applica6ons   4.  Developed  a  predic6on  methodology  using  signature   microbenchmarks  and  showed  that  our  predic6ons   improve  throughput  over  no  priori6za6on  (default)   4  By  Mitesh  R.  Meswani  11/20/2008  
  • 5. Experimental  Pla^orm:  Thread   Priori6es  in  IBM  POWER5     •  Six  out  of  eight  priori6es  available  to  the   opera6ng  system    for  normal  mode  of  opera6on:   1,  2,  3,  4  (default),    5,  and  6   •  Difference  in  hardware  thread  priori6es  control   decode  cycle  sharing   – Higher  Priority  thread  gets  more  decode  cycles   – Equal  Priori6es  (default)  gives  one  out  of  two  decode   cycles  to  each  thread     5  By  Mitesh  R.  Meswani  11/20/2008  
  • 6. Signatures   1. Iden6fy  Significant  Resources  :  Floa6ng-­‐point  unit  (FPU),   Fixed-­‐point  unit  (FXU),  L2  unified  cache,  and  L2  unified  TLB     2. Capture  u6liza6on  of  resources  using  performance   counters   3. Define  u6liza6on  levels  of  resources  in  Single-­‐Threaded   mode,  forming  a  signature   –  Ten  u6liza6on  levels  L1  to  L10  per  resource   –  Example:  L1L2L3L9,  L9L6L7L8,  L2L3L10L6…   6  By  Mitesh  R.  Meswani  11/20/2008  
  • 7. Work  Flow   Performance   Counter   SeWngs     Step  1:  Find  Signatures  of  Real  Applica6ons   Run  Applica6on  and   Periodically  Sample     Counters     Serial  Applica6on   Single-­‐ Threaded   Mode   Signature  Data   Base   Signatures   Signature-­‐microbenchmark  Pair  X,  Y                                                               CPI   Step  2:  Create  Signature  Microbenchmarks  for   Frequently  Appearing  Signatures  and  Empirically   Find  Priority  Predic6ons     Run  Signature-­‐ Microbenchmark   Pair   Priori6es      i,  j    in  SMT     Mode   Predic6on  Data   Base   Store  CPI    for  all   priori6es  for   Pair  X,  Y       Iden6fy  Best   Case  Priority  for   Pair  X,  Y   Predic6ons   Step  3:  Execute  Applica6on  Pairs  using     Predicted  Priori6es   Signature   Data  Base   Predic6on   Data  Base   Read  Signatures   Applica6on  Pair  A,  B                                                               Read  Priori6es   Yes            Signature  of  A,B   Run  Pair  A,  B    with   Predicted     Priori6es  in  SMT   Mode   Priority  of  A,   Priority  of  B   Found     Domina6ng   Signatures  ?   Run  Pair  A,  B   with  Equal   Priori6es  in   SMT  Mode   No   7  By  Mitesh  R.  Meswani  11/20/2008  
  • 8. Details  of  Step  1   •  Four  groups  of  counters  were  measured   •  Each  group  measured  in  separate  runs   •  Sampled  in  one  second  6me  intervals   • Signature  of  an  interval  is  composed  from  u6liza6on  for  that  interval  from  4  runs     Interval  0   0    1        2        3        4      5        6        7      8        9      10    11  12  13    14  15    16  17    18  19    20  21                                                                                                                                                                                                                                          Sample#     Run  1     Run  2     Run  3     Run  4   8  By  Mitesh  R.  Meswani  11/20/2008  
  • 9. Different  Signatures  are  Present  in  Real  Applica6ons   (SPEC  CPU2006,  NAS  NPB  SER,  PETSc  KSP/Matrix)     0%   10%   20%   30%   40%   50%   60%   70%   80%   90%   100%   429.mcf   416.gamess   444.namd   462.libquantum   cgs   gmres   L1L1L1L1   L3L1L1L1   L3L2L1L1   L2L1L1L1   L2L3L1L1   L2L2L1L1   L1L4L1L1   L1L1L9L5   L1L2L7L4   L1L1L7L4   L1L1L6L4   L1L2L6L3   L1L2L5L2   L1L3L1L1   L1L2L2L1   L1L2L3L1   L1L2L6L4   L1L2L5L4   L1L2L5L3   L1L2L4L3   L1L2L4L2   L1L2L3L2   L1L1L2L1   L1L2L1L1   %  of  Total  Cycles   Signature  Histogram  of  Four  SPEC  CPU2006  and  Two  PETSc  KSP  Library  FuncDons   ApplicaDons   One  Signature  >  80%  (dominant)     9  By  Mitesh  R.  Meswani  11/20/2008  
  • 10. Conclusions   1.  Showed  that  equal  priori6es  (default)  are  not  the  best   for  nearly  47%  of  applica6ons  studied   2.  Only  16  Signatures  are  sufficient  to  represent  95.5%  of   execu6on  6me  of  20  SPEC  CPU2006  benchmarks,  9  NAS   NPB3.2  Serial  benchmarks,    119  PETSc  KSP,  and  180   PETSc  Matrix  libraries   3.  Priority  predic6ons  using  signature  benchmarks   improve  throughput  over  default  seWngs  for  87%  of  the   15  PETSc  KSP  coschedules.   10  By  Mitesh  R.  Meswani  11/20/2008  
  • 11. Applica6ons  with  Mul6ple  Signatures   11  By  Mitesh  R.  Meswani  11/20/2008   DisDnct  TransiDons   DisDnct  TransiDons   Long  Phases   RepeaDng  Small  Phases  
  • 12. Future  Work  and  References   Future  Work:   •  Iden6fy  applica6ons  with  mul6ple  signatures   •  Dynamic  adapta6on  of  priori6es   •  Detec6ng  signatures  on  the  fly   •  Phase  detec6on  and  Predic6on  for  a  truly  adap6ve  system     References:   •  M.  R.  Meswani,  P.  J.  Teller,  and  S.  Arunangiri.,  “A  Study  of  the  Influence   of  the  POWER5  Dynamic  Resource  Balancing  Hardware  on  Op6mal   Hardware  Thread  Priori6es,”  To  Appear  in  the  Proceedings  of  the  2008   Live  Virtual  Construc=ve  Conference,  Jan  2009,  El  Paso,  TX     •  M.  R.  Meswani  and  P.  J.  Teller,  “  Evalua6ng  the  Performance  Impact  of   Hardware  Thread  Priori6es  in  Simultaneous  Mul6threaded  Processors   using  SPEC  CPU2000,”  Proceedings  of  the  2nd  Interna=onal  Workshop   on  Opera=ng  Systems  Interference  In  High  Performance  Applica=ons,  in   conjunc6on  with  the  15th  Interna6onal  Conferences  on  Parallel   Architectures  and  Compila6on  Techniques  (PACT06)  Conference,   sponsored  by  ACM  and  IEEE,  September  2006,  Seaqle,  WA.   12  By  Mitesh  R.  Meswani  11/20/2008  
  • 13. Acknowledgements   •  This  work  is  supported  by  AHPCRC  Grant   W11NF-­‐07-­‐2-­‐2007   •  Dr.  Patricia  J.  Teller,  Professor,  UTEP  (Advisor)     •  Amir  Simon,  IBM  for  assistance  with  p550   machine   •  Email:  mitesh.meswani@gmail.com   •  URL:  www.linkedin.com/in/miteshmeswani   11/20/2008   By  Mitesh  R.  Meswani   13