SlideShare a Scribd company logo
Performance	
  of	
  Go	
  
on	
  Mul/core	
  
Systems	
  
Huang	
  Yipeng	
  
19th	
  November	
  2012	
  
FYP	
  Presenta/on	
  
1	
  
Mo/va/on	
  
•  Mul-core	
  systems	
  have	
  become	
  common	
  
•  But	
  “dual,	
  quad-­‐cores	
  are	
  not	
  useful	
  all	
  the	
  /me,	
  
they	
  waste	
  baEeries...”	
  -­‐	
  Stephen	
  Elop,	
  Nokia	
  CEO	
  
2	
  
Mo/va/on	
  
•  Mul-core	
  systems	
  have	
  become	
  common	
  
•  But	
  “dual,	
  quad-­‐cores	
  are	
  not	
  useful	
  all	
  the	
  /me,	
  
they	
  waste	
  baEeries...”	
  -­‐	
  Stephen	
  Elop,	
  Nokia	
  CEO	
  
•  Because	
  most	
  programs	
  are	
  explicitly	
  parallel	
  
–  #Threads	
  
–  #Cores	
  	
  
3	
  
Mo/va/on:	
  Why	
  Go?	
  	
  
4	
  
Objec/ve	
  
•  To	
  study	
  the	
  parallelism	
  performance	
  of	
  Go,	
  
compared	
  with	
  C,	
  using	
  measurements	
  and	
  
analy-cal	
  models	
  (to	
  quan/fy	
  actual	
  and	
  predicted	
  
performances	
  respec/vely)	
  	
  
5	
  
Related	
  Work	
  
•  Understanding	
  the	
  Off-­‐chip	
  Memory	
  Conten/on	
  of	
  
Parallel	
  Programs	
  in	
  Mul/core	
  Systems	
  (B.M.	
  Tudor,	
  
Y.M.	
  Teo,	
  2011)	
  
•  A	
  Prac/cal	
  Approach	
  for	
  Performance	
  Analysis	
  of	
  
Shared	
  Memory	
  Programs	
  (B.M.	
  Tudor,	
  Y.M.	
  Teo,	
  2011)	
  
6	
  
Parallelism	
  of	
  Shared-­‐memory	
  Program	
  
Memory	
  
Conten/on	
  
Useful	
  Work	
  
Data	
  
Dependency	
  
Related	
  Work:	
  Differences	
  
7	
  
Shared	
  Memory	
  Programs	
  
Shared	
  Memory	
  Programs	
  Implicit	
  
Parallelism	
  	
  
e.g.	
  Go	
  
Explicit	
  
Parallelism	
  
e.g.	
  C	
  &	
  OpenMP	
  
Processor	
  Architecture	
  	
  
Shared	
  Memory	
  Programs	
  Emerging	
  
pladorms	
  
e.g.	
  ARM	
  
Mul/core	
  
pladorms	
  
e.g.	
  Intel,	
  AMD	
  
Parallelism	
  Performance	
  	
  
Analy/cal	
  	
  Models	
  
Low	
  Memory	
  
Conten/on	
  
High	
  Memory	
  
Conten/on	
  
Contribu/ons	
  
1.  Insights	
  about	
  the	
  parallelism	
  performance	
  of	
  Go	
  
2.  Extend	
  our	
  analy/cal	
  parallelism	
  model	
  for	
  
programs	
  with	
  lower	
  memory	
  conten/on	
  
3.  Automate	
  performance	
  predic/on	
  and	
  model	
  
valida/on	
  with	
  scripts	
  	
  
8	
  
Outline	
  	
  
•  Mo/va/on	
  
•  Related	
  Work	
  	
  
•  Methodology	
  
–  Approach	
  
–  Valida/on	
  
•  Evalua/on	
  
•  Conclusion	
  
9	
  
Process	
  Methodology	
  
10	
  
Analy/cal	
  Models	
  
Baseline	
  Execu/ons	
  
Parallelism	
  Traces	
   Parallelism	
  Traces	
  
1.  Hardware	
  
Counters	
  
(Perf	
  Stat	
  3.0)	
  
2.  Run	
  Queue	
  
(Proc	
  Reader)	
  
Parallelism	
  
Predic/on	
  
Go	
  Program	
  
Analy/cal	
  Parallelism	
  Model	
  	
  
Parallelism	
  of	
  Shared-­‐memory	
  Program:	
  	
  
m	
  threads,	
  n	
  cores	
  
Number of Threads: m
Exploited Parallelism: π’Contention: M(n)
Memory	
  
Conten/on	
  
Useful	
  Work	
  
Data	
  
Dependency	
  
11	
  
Experimental	
  Setup:	
  Workloads	
  
12	
  
Non-­‐Uniform	
  Memory	
  Access	
  (24	
  cores):	
  Dual	
  six-­‐core	
  Intel	
  Xeon	
  
X5650	
  2.67	
  GHz,	
  2	
  hardware	
  threads	
  per	
  core,	
  12MB	
  L3	
  cache,	
  16	
  
GB	
  RAM,	
  running	
  Linux	
  Kernel	
  3.0	
  	
  
Experimental	
  Setup:	
  Machine	
  
13	
  
Outline	
  	
  
•  Mo/va/on	
  
•  Related	
  Work	
  	
  
•  Methodology	
  
–  Approach	
  
–  Valida-on	
  
•  Evalua/on	
  
•  Conclusion	
  
14	
  
The	
  Memory	
  Conten/on	
  Model	
  
SP	
  (Class	
  C)	
  
15	
  
9.7	
  
Defini-on:	
  Low	
  conten6on	
  
problems	
  have	
  a	
  conten/on	
  
≤	
  1.2	
  	
  
Observa-on:	
  Low	
  conten/on	
  
problems	
  exhibt	
  a	
  W-­‐like	
  
paEern	
  not	
  captured	
  by	
  the	
  
model.	
  	
  
Why	
  does	
  this	
  occur?	
  
Valida/on	
  of	
  Memory	
  Cont.	
  Model	
  
Mandelbrot	
  
Fannkuck-­‐Redux	
  
Spectral	
  Norm	
  
EP	
  (Class	
  C)	
  
16	
  
Original	
  Model:	
  Matrix	
  Mul	
  
17	
  
Modifica/on	
  of	
  Memory	
  Cont.	
  Model	
  
Model	
  revalidated...	
  	
  
1.  For	
  Matrix	
  Mul/plica/on	
  (down	
  from	
  50%	
  error	
  to	
  7%)	
  
2.  For	
  other	
  low	
  conten/on	
  programs	
  	
  
3.  In	
  Go	
  and	
  C	
  
4.  On	
  Intel	
  and	
  ARM	
  mul/cores	
  	
  
Revised	
  Model:	
  Matrix	
  Mul	
  
Outline	
  	
  
•  Mo/va/on	
  
•  Related	
  Work	
  	
  
•  Methodology	
  
–  Approach	
  
–  Valida/on	
  
•  Evalua-on	
  
•  Conclusion	
  
18	
  
Performance	
  analysis:	
  Go	
  vs	
  C	
  
1.  How	
  much	
  poorer	
  is	
  Go	
  compared	
  to	
  C?	
  Why?	
  
–  Run/me,	
  speedup	
  vs	
  #Cores	
  
2.  Could	
  Go	
  outperform	
  C?	
  	
  
–  Run/me	
  vs	
  Problem	
  size	
  	
  
–  Run/me	
  vs	
  #Threads	
  
3.  Predictability	
  of	
  actual	
  performance	
  
–  Modeled	
  vs	
  Measured	
  
–  Conten/on	
  vs	
  #Cores	
  
–  Prob.	
  size	
  vs	
  Exp.	
  Parallelism	
  /	
  Data	
  Dep.	
  /	
  Conten/on	
  
19	
  
Points	
  of	
  Comparison	
  
20	
  
Unop/mized	
   Op/mized	
  
Compiler	
  Op/miza/on	
   Programmer	
  Op/miza/on	
  
Experiment	
  1	
  
Matrix	
  Mul/plica/on	
  (4992*4992)	
  	
  
No	
  op/miza/on	
  flags	
  (-­‐N	
  for	
  Go)	
  
#threads	
  =	
  24	
  
Go	
  is	
  comparable	
  with	
  C	
  
Points	
  of	
  Comparison	
  
21	
  
Unop/mized	
   Op/mized	
  
Compiler	
  Op/miza/on	
   Programmer	
  Op/miza/on	
  
Experiment	
  1	
  
Matrix	
  Mul/plica/on	
  (4992*4992)	
  	
  
No	
  op/miza/on	
  flags	
  (-­‐N	
  for	
  Go)	
  
#threads	
  =	
  24	
  
Go	
  is	
  comparable	
  with	
  C	
  
Experiment	
  2	
  
Matrix	
  Mul/plica/on	
  (4992*4992)	
  	
  
-­‐O3	
  op/miza/on	
  for	
  C,	
  No	
  flag	
  for	
  Go	
  
#threads	
  =	
  24	
  
Go	
  is	
  marginally	
  worse	
  than	
  C	
  
Points	
  of	
  Comparison	
  
22	
  
Unop/mized	
   Op/mized	
  
Compiler	
  Op/miza/on	
   Programmer	
  Op/miza/on	
  
Experiment	
  1	
  
Matrix	
  Mul/plica/on	
  (4992*4992)	
  	
  
No	
  op/miza/on	
  flags	
  (-­‐N	
  for	
  Go)	
  
#threads	
  =	
  24	
  
Experiment	
  2	
  
Matrix	
  Mul/plica/on	
  (4992*4992)	
  	
  
-­‐O3	
  op/miza/on	
  for	
  C,	
  No	
  flag	
  for	
  Go	
  
#threads	
  =	
  24	
  
Go	
  is	
  marginally	
  slowerthan	
  C	
  
Experiment	
  3	
  
Transposed	
  Matrix	
  Mul/plica/on	
  (4992*4992)	
  	
  
-­‐O3	
  op/miza/on	
  for	
  C,	
  No	
  flag	
  for	
  Go	
  
#threads	
  =	
  24	
  
Go	
  is	
  much	
  worse	
  than	
  C	
  	
  
Observa-ons:	
  
•  Sequen-al:	
  Go	
  is	
  16%	
  slower	
  	
  
•  Parallel:	
  Go	
  is	
  up	
  to	
  5%	
  faster	
  
No	
  Op/miza/on:	
  Run/me	
  vs	
  #Cores	
  
23	
  
MatrixMul(#threads	
  =	
  24,	
  P	
  size	
  =	
  5K)	
  	
  
Effect	
  of	
  #cores	
  on	
  run/me	
  
MatrixMul(#threads	
  =	
  24,	
  P	
  size	
  =	
  5K)	
  	
  
Effect	
  of	
  #cores	
  on	
  X	
  ra/o	
  
Reasons	
  
Observa-ons	
  (in	
  Go)	
  
1.  Instruc-ons	
  executed:	
  	
  
	
  12%	
  less	
  
2.  #Cycles:	
  	
  
	
  sequen/al	
  (16%	
  higher),	
  	
  
	
  parallel	
  (5%	
  less)	
  	
  
3.  Cache	
  Misses:	
  	
  	
  
	
  sequen/al	
  (27x	
  worse),	
  	
  
	
  parallel	
  (similar)	
  	
  
24	
  
Conclusions	
  
•  Go’s	
  poor	
  sequen/al	
  performance	
  caused	
  
by	
  heavy	
  cache	
  miss	
  rate.	
  Likely	
  result	
  of	
  
parallel	
  overhead.	
  	
  
Observa-ons:	
  
•  Go	
  makes	
  up	
  for	
  poor	
  sequen/al	
  performance	
  with	
  a	
  higher	
  speedup.	
  
•  Normalized	
  Go	
  speedup	
  is	
  marginally	
  beEer	
  (up	
  to	
  1.05x),	
  except	
  on	
  1/24	
  cores	
  
(0.86x/0.97x)	
  	
  
No	
  Op/miza/on:	
  Parallelism	
  (Speedup)	
  vs	
  #Cores	
  
25	
  
MatrixMul(#threads	
  =	
  24,	
  P	
  size	
  =	
  5K)	
  	
  
Effect	
  of	
  #cores	
  on	
  speedup	
  
MatrixMul(#threads	
  =	
  24,	
  P	
  size	
  =	
  5K)	
  	
  
Effect	
  of	
  #cores	
  on	
  norm.	
  speedup	
  	
  
(against	
  best	
  seq.	
  execu/on	
  /me)	
  
Observa-ons:	
  
•  Sequen-al:	
  Go	
  is	
  400%	
  slower	
  	
  
•  Parallel:	
  Go	
  is	
  180-­‐340%	
  slower	
  
Both	
  Op/miza/ons:	
  Run/me	
  vs	
  #Cores	
  
26	
  
MatrixMul	
  –O3(#threads	
  =	
  24,	
  P	
  size	
  =	
  5K)	
  
Effect	
  of	
  #cores	
  on	
  run/me	
  
MatrixMul	
  –O3(#threads	
  =	
  24,	
  P	
  size	
  =	
  5K)	
  
Effect	
  of	
  #cores	
  on	
  X	
  difference	
  
Reasons	
  
27	
  
Observa-ons	
  (in	
  Go)	
  
1.  Instruc-ons	
  executed:	
  	
  
	
  5.2x	
  as	
  many	
  
2.  #Cycles:	
  	
  
	
  sequen/al	
  (400%	
  higher),	
  	
  
	
  parallel	
  (180%	
  higher)	
  	
  
3.  Cache	
  Misses:	
  	
  	
  
	
  sequen/al	
  (64%	
  less),	
  	
  
	
  parallel	
  (56%	
  less)	
  	
  
Conclusions	
  
•  Go’s	
  op-miza-on	
  not	
  as	
  mature	
  as	
  C’s	
  
Sequen/al	
  instruc/ons	
  reduced	
  1.3x	
  vs	
  8x,	
  cycles	
  
reduced	
  4x	
  vs	
  18x	
  	
  
•  Go	
  has	
  beVer	
  cache	
  management	
  	
  
Observa-ons:	
  
•  Go	
  speedup	
  is	
  higher	
  than	
  C’s	
  on	
  its	
  own	
  base,	
  but	
  significantly	
  worse	
  when	
  normalized.	
  	
  
•  Secondary	
  Objec-ve:	
  Given	
  that	
  Go	
  has	
  a	
  higher	
  own-­‐base	
  speedup,	
  could	
  it	
  beat	
  C	
  if	
  we	
  
increase	
  the	
  problem	
  size?	
  	
  	
  
Both	
  Op/miza/ons:	
  Parallelism	
  vs	
  #Cores	
  
28	
  
MatrixMul	
  –O3(#threads	
  =	
  24,	
  P	
  size	
  =	
  5K)	
  
Effect	
  of	
  #cores	
  on	
  speedup	
  
MatrixMul	
  –O3(#threads	
  =	
  24,	
  P	
  size	
  =	
  5K)	
  
Effect	
  of	
  #cores	
  on	
  norm.	
  speedup	
  
Observa-on:	
  
•  Variance	
  in	
  the	
  /mes	
  ra/o	
  reduces	
  from	
  1.0-­‐1.3	
  to	
  1.0-­‐1.1	
  
Conclusion:	
  	
  
•  In	
  general,	
  Go	
  is	
  increasingly	
  compe//ve	
  as	
  the	
  problem	
  size	
  increases.	
  	
  
Compiler	
  Op/miza/on:	
  Varying	
  Problem	
  Size	
  
29	
  
MatrixMul	
  –O3(#threads	
  =	
  24,	
  P	
  size	
  =	
  10K)	
  
Effect	
  of	
  #cores	
  on	
  X	
  difference	
  
MatrixMul	
  –O3(#threads	
  =	
  24,	
  P	
  size	
  =	
  5K)	
  	
  
Effect	
  of	
  #cores	
  on	
  X	
  difference	
  
Both	
  Op/miza/ons:	
  Varying	
  Problem	
  Size	
  
30	
  
MatrixMul	
  –O3(#threads	
  =	
  24)	
  Effect	
  
of	
   problem	
   size,	
   #cores	
   on	
   /mes	
  
difference	
  
Observa-on:	
  
•  The	
  /mes	
  ra/o	
  decreases	
  as	
  the	
  
problem	
  size	
  increases	
  on	
  1-­‐20	
  
cores.	
  	
  
Conclusion:	
  	
  
•  There	
  is	
  a	
  valley	
  of	
  performance	
  
on	
  intermediate	
  core	
  numbers.	
  	
  
Both	
  Op/miza/ons:	
  	
  Run/me	
  vs	
  #threads	
  
31	
  
Observa-on:	
  
•  Go’s	
  rela/ve	
  performance	
  as	
  the	
  
#threads	
  increases.	
  	
  
Conclusions:	
  
•  The	
  cost	
  of	
  gorou/nes	
  in	
  Go	
  is	
  
extremely	
  low.	
  
•  Go’s	
  performance	
  may	
  improve	
  on	
  
problems	
  with	
  high	
  data	
  dependency.	
  
MatrixMul	
  (#cores=	
  24,	
  Problem	
  size	
  =	
  5K)	
  	
  
Effect	
  of	
  #threads	
  on	
  run/me	
  
Predictability	
  of	
  Actual	
  Performance	
  
•  Objec-ve:	
  To	
  determine	
  how	
  Go	
  compares	
  to	
  C	
  with	
  
regard	
  to	
  mul/core	
  predictability	
  as	
  we	
  change	
  the	
  
#cores,	
  #threads,	
  problem	
  size	
  
•  Observa-ons	
  (in	
  Go):	
  	
  
–  Model	
  exhibits	
  beEer	
  accuracy	
  	
  
–  Memory	
  Conten/on	
  does	
  not	
  fluctuate	
  as	
  #cores	
  changes	
  	
  
–  Measurements	
  consistent	
  with	
  assump/ons	
  as	
  problem	
  size	
  
changes	
  	
  
•  Result:	
  Go	
  exhibts	
  proper/es	
  useful	
  for	
  predic/on	
  that	
  
C	
  does	
  not.	
  	
  
32	
  
Observa-ons	
  
•  Conten/on	
  Error	
  
–  C	
  	
  	
  	
  (Avg:	
  15%,	
  Max:	
  55%	
  )	
  
–  Go	
  (Avg:	
  3%,	
  Max:	
  14%)	
  
•  Parallelism	
  Error	
  
–  C	
  	
  	
  	
  (Avg:	
  18%,	
  Max:	
  44%)	
  
–  Go	
  (Avg:	
  6%,	
  Max:	
  15%)	
  
•  Run/me	
  Error	
  
–  	
  C	
  (Avg:	
  16%,	
  Max:	
  47%)	
  
–  Go	
  (Avg:	
  5%,	
  Max:	
  13%)	
  
Conclusion	
  
•  Go	
  has	
  a	
  beEer	
  predictability	
  than	
  C	
  
Predictability	
  of	
  Performance	
  
Modeled	
  vs	
  Measured	
  
33	
  
MatrixMul	
  –O3(#threads	
  =	
  24,	
  P=17K)	
  
Effect	
  of	
  #cores	
  on	
  conten/on	
  factor	
  
Observa-ons	
  
•  In	
  C	
  ,	
  conten/on	
  flucuates	
  (0-­‐5.6)	
  	
  
•  Not	
  so	
  much	
  in	
  Go	
  (0-­‐1)	
  
Conclusion	
  	
  
•  Garbage	
  Collec/on,	
  Channel	
  U/l	
  	
  
•  A	
  conten/on	
  factor	
  can	
  be	
  easily	
  
bounded	
  in	
  Go	
  to	
  guarantee	
  
performance	
  of	
  some	
  other	
  program.	
  	
  
Predictability	
  of	
  Performance	
  
Conten/on	
  vs	
  #Cores	
  
34	
  
MatrixMul	
  –O3(#threads	
  =	
  24,	
  P=17K)	
  
Effect	
  of	
  #cores	
  on	
  conten/on	
  factor	
  
Predictability	
  of	
  Performance	
  
Modeling	
  across	
  problem	
  sizes	
  
•  Objec-ve:	
  Can	
  we	
  perform	
  measurements	
  on	
  
smaller	
  problem	
  sizes	
  to	
  reduce	
  run/me	
  of	
  
parallelism	
  predic/on?	
  	
  
35	
  
Predictability	
  of	
  Performance	
  
Problem	
  size	
  vs	
  Exploit.	
  Parallelism	
  
36	
  
Go	
  MatrixMul	
  (#threads	
  =	
  24,	
  P=17K)	
  
Effect	
   of	
   problem	
   size	
   on	
   exploited	
  
parallelsim	
  	
  
C	
   MatrixMul	
   (#threads	
   =	
   24,	
   P=17K)	
  
Effect	
   of	
   problem	
   size	
   on	
   exploited	
  
parallelsim	
  	
  
Observa-ons	
  (in	
  Go)	
  
•  Exploited	
  Parallelism	
  only	
  decreases	
  slightly	
  as	
  problem	
  size	
  increases	
  
Predictability	
  of	
  Performance	
  
Problem	
  size	
  vs	
  Data	
  Dependency	
  
37	
  
Go	
  MatrixMul	
  (#threads	
  =	
  24,	
  P=17K)	
  
Effect	
   of	
   problem	
   size	
   on	
   exploited	
  
parallelsim	
  	
  
C	
   MatrixMul	
   (#threads	
   =	
   24,	
   P=17K)	
  
Effect	
   of	
   problem	
   size	
   on	
   exploited	
  
parallelsim	
  	
  
Observa-ons	
  (in	
  Go)	
  
•  Data	
  Dependency	
  decreases	
  as	
  expected	
  as	
  problem	
  size	
  increases	
  
Predictability	
  of	
  Performance	
  
Problem	
  size	
  vs	
  Conten/on	
  
38	
  
Go	
  MatrixMul	
  (#threads	
  =	
  24,	
  P=17K)	
  
Effect	
   of	
   problem	
   size	
   on	
   exploited	
  
parallelsim	
  	
  
C	
   MatrixMul	
   (#threads	
   =	
   24,	
   P=17K)	
  
Effect	
   of	
   problem	
   size	
   on	
   exploited	
  
parallelsim	
  	
  
Observa-ons	
  (in	
  Go)	
  
•  Memory	
  conten/on	
  only	
  increases	
  slightly	
  as	
  problem	
  size	
  increases	
  
Conclusion:	
  
•  Measurements	
  inputs	
  on	
  small	
  problems	
  are	
  more	
  accurate	
  in	
  Go	
  than	
  in	
  C	
  
Conclusion	
  
1.  How	
  does	
  Go	
  compare	
  to	
  C	
  in	
  a	
  mul-core	
  environment?	
  	
  
Go’s	
  Actual	
  Performance	
  
–  Comparable	
  performance	
  before,	
  Inferior	
  performance	
  aver	
  programmer	
  
op/miza/on	
  
–  Consequence	
  of	
  different	
  levels	
  of	
  op/miza/on	
  	
  
–  Performance	
  margin	
  decreases	
  as	
  the	
  problem	
  size	
  increases	
  on	
  intermediate	
  
core	
  numbers	
  
–  Cost	
  of	
  gorou/nes	
  much	
  lower	
  than	
  threads	
  
Go’s	
  Predicted	
  Performance	
  	
  
–  Model	
  exhibits	
  beEer	
  accuracy	
  	
  
–  Memory	
  Conten/on	
  does	
  not	
  fluctuate	
  as	
  #cores	
  changes	
  	
  
–  Measurements	
  consistent	
  with	
  assump/ons	
  as	
  problem	
  size	
  changes	
  	
  
39	
  
Conclusion	
  
2.  Is	
  the	
  model	
  extensible	
  beyond	
  C,	
  tradi-onal	
  
mul-cores,	
  and	
  high	
  conten-on?	
  
–  Modified	
  /	
  Validated	
  for	
  low	
  conten/on	
  problems	
  
–  Validated	
  for	
  the	
  Go	
  language	
  
–  Validated	
  for	
  ARM	
  devices	
  	
  
3.  Can	
  we	
  make	
  the	
  model	
  easier	
  to	
  use?	
  
–  Formally	
  defined	
  valida/on	
  criteria	
  
–  Wrote	
  script	
  to	
  perform	
  model	
  valida/on	
  
–  Wrote	
  script	
  to	
  perform	
  performance	
  predic/on	
  	
  
–  *Future	
  Work*	
  Front	
  end	
  for	
  predic/on	
  
40	
  
Observa-ons:	
  
•  Sequen-al:	
  Go	
  is	
  31%	
  slower	
  	
  
•  Parallel:	
  Go	
  is	
  up	
  to	
  0-­‐28%	
  slower	
  
•  On	
  UMA,	
  /mes	
  ra/o	
  decreases	
  as	
  #cores	
  increases	
  	
  
Compiler	
  Op/miza/on:	
  Run/me	
  vs	
  #Cores	
  
41	
  
MatrixMul	
  –O3	
  (#threads	
  =	
  24,	
  P	
  size	
  =	
  5K)	
  	
  
Effect	
  of	
  #cores	
  on	
  run/me	
  
MatrixMul	
  –O3	
  (#threads	
  =	
  24,	
  P	
  size	
  =	
  5K)	
  
Effect	
  of	
  #cores	
  on	
  X	
  difference	
  
Reasons	
  
42	
  
Observa-ons	
  (in	
  Go)	
  
1.  Instruc-ons	
  executed:	
  	
  
	
  4.5x	
  as	
  many	
  
2.  #Cycles:	
  	
  
	
  sequen/al	
  (30%	
  higher),	
  	
  
	
  parallel	
  (similar)	
  	
  
3.  Cache	
  Misses:	
  	
  	
  
	
  sequen/al	
  (10%	
  higher),	
  	
  
	
  parallel	
  (46%	
  less)	
  	
  
Observa-ons:	
  
•  Go	
  speedup	
  is	
  higher	
  than	
  C’s	
  on	
  its	
  own	
  base,	
  but	
  lower	
  when	
  normalized.	
  	
  
•  Secondary	
  Objec-ve:	
  Given	
  that	
  Go	
  has	
  a	
  higher	
  own-­‐base	
  speedup,	
  could	
  it	
  beat	
  
C	
  if	
  we	
  increase	
  the	
  problem	
  size?	
  	
  	
  
Compiler	
  Op/miza/on:	
  Parallelism	
  vs	
  #Cores	
  
43	
  
MatrixMul	
  –O3(#threads	
  =	
  24,	
  P	
  size	
  =	
  5K)	
  	
  
Effect	
  of	
  #cores	
  on	
  Exp.	
  Parallelism	
  
MatrixMul	
  –O3(#threads	
  =	
  24,	
  P	
  size	
  =	
  5K)	
  	
  
Effect	
  of	
  #cores	
  on	
  norm.	
  speedup	
  
Sequen/al	
  Op/miza/on	
  
44	
  
No	
  op/miza/on	
  	
  
Compiler	
  op/miza/on	
  	
  
Compiler	
  +	
  Programmer	
  op/miza/on	
  	
  
Predictability	
  of	
  Performance	
  
Modeling	
  across	
  problem	
  sizes	
  
•  Objec-ve:	
  Can	
  we	
  perform	
  measurements	
  on	
  smaller	
  
problem	
  sizes	
  to	
  reduce	
  run/me	
  of	
  parallelism	
  
predic/on?	
  	
  
•  Observa-on:	
  The	
  performance	
  profiles	
  in	
  Go	
  are	
  
consistent	
  with	
  expecta/ons	
  as	
  problem	
  size	
  changes	
  	
  
•  Result:	
  	
  Measurements	
  inputs	
  on	
  small	
  problems	
  are	
  
more	
  accurate	
  in	
  Go	
  than	
  in	
  C	
  
45	
  

More Related Content

Viewers also liked

Kliaudaitis k vi0_1
Kliaudaitis k vi0_1Kliaudaitis k vi0_1
Kliaudaitis k vi0_1
Karolis Kliaudaitis
 
Running Technique and Gear
Running Technique and GearRunning Technique and Gear
Running Technique and Gear
nobody020
 
reptProblem
reptProblemreptProblem
reptProblem
Nadra Najib
 
interface types : touch & shareable
interface types : touch & shareableinterface types : touch & shareable
interface types : touch & shareable
Nadra Najib
 
Trabajo biologia
Trabajo biologiaTrabajo biologia
Trabajo biologia
alexslidesharealex
 
siete claves para refinar la mirada
siete claves para refinar la miradasiete claves para refinar la mirada
siete claves para refinar la mirada
frasou
 
cine impresionista y expresionista
cine impresionista y expresionistacine impresionista y expresionista
cine impresionista y expresionista
alexslidesharealex
 
Idc Insights Overview 2012
Idc Insights Overview   2012Idc Insights Overview   2012
Idc Insights Overview 2012
sdenton20
 
Soccer
SoccerSoccer
Soccer
Girelle Luna
 
Massimo dutti
Massimo duttiMassimo dutti
Massimo dutti
AlessandroStrepparola
 

Viewers also liked (10)

Kliaudaitis k vi0_1
Kliaudaitis k vi0_1Kliaudaitis k vi0_1
Kliaudaitis k vi0_1
 
Running Technique and Gear
Running Technique and GearRunning Technique and Gear
Running Technique and Gear
 
reptProblem
reptProblemreptProblem
reptProblem
 
interface types : touch & shareable
interface types : touch & shareableinterface types : touch & shareable
interface types : touch & shareable
 
Trabajo biologia
Trabajo biologiaTrabajo biologia
Trabajo biologia
 
siete claves para refinar la mirada
siete claves para refinar la miradasiete claves para refinar la mirada
siete claves para refinar la mirada
 
cine impresionista y expresionista
cine impresionista y expresionistacine impresionista y expresionista
cine impresionista y expresionista
 
Idc Insights Overview 2012
Idc Insights Overview   2012Idc Insights Overview   2012
Idc Insights Overview 2012
 
Soccer
SoccerSoccer
Soccer
 
Massimo dutti
Massimo duttiMassimo dutti
Massimo dutti
 

Similar to Performance of Go on Multicore Systems

Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Databricks
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Databricks
 
Unit 3 part2
Unit 3 part2Unit 3 part2
Unit 3 part2
Karthik Vivek
 
Unit 3 part2
Unit 3 part2Unit 3 part2
Unit 3 part2
Karthik Vivek
 
Unit 3 part2
Unit 3 part2Unit 3 part2
Unit 3 part2
Karthik Vivek
 
Gopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracowGopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracow
MateuszSzczyrzyca
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Sergey Karayev
 
“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps
Rui Quintino
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
CSUC - Consorci de Serveis Universitaris de Catalunya
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
Arnaud Rachez
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
CSUC - Consorci de Serveis Universitaris de Catalunya
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
inside-BigData.com
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
Sudarsun Santhiappan
 
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...
Unai Lopez-Novoa
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
Greg Makowski
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
optimizing code in compilers using parallel genetic algorithm
optimizing code in compilers using parallel genetic algorithm optimizing code in compilers using parallel genetic algorithm
optimizing code in compilers using parallel genetic algorithm
Fatemeh Karimi
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram
Praveen Penumathsa
 
Optimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant callerOptimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant caller
AllineaSoftware
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
Vinay Shukla
 

Similar to Performance of Go on Multicore Systems (20)

Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
 
Unit 3 part2
Unit 3 part2Unit 3 part2
Unit 3 part2
 
Unit 3 part2
Unit 3 part2Unit 3 part2
Unit 3 part2
 
Unit 3 part2
Unit 3 part2Unit 3 part2
Unit 3 part2
 
Gopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracowGopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracow
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
 
“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
optimizing code in compilers using parallel genetic algorithm
optimizing code in compilers using parallel genetic algorithm optimizing code in compilers using parallel genetic algorithm
optimizing code in compilers using parallel genetic algorithm
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram
 
Optimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant callerOptimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant caller
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
 

Recently uploaded

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 

Recently uploaded (20)

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 

Performance of Go on Multicore Systems

  • 1. Performance  of  Go   on  Mul/core   Systems   Huang  Yipeng   19th  November  2012   FYP  Presenta/on   1  
  • 2. Mo/va/on   •  Mul-core  systems  have  become  common   •  But  “dual,  quad-­‐cores  are  not  useful  all  the  /me,   they  waste  baEeries...”  -­‐  Stephen  Elop,  Nokia  CEO   2  
  • 3. Mo/va/on   •  Mul-core  systems  have  become  common   •  But  “dual,  quad-­‐cores  are  not  useful  all  the  /me,   they  waste  baEeries...”  -­‐  Stephen  Elop,  Nokia  CEO   •  Because  most  programs  are  explicitly  parallel   –  #Threads   –  #Cores     3  
  • 5. Objec/ve   •  To  study  the  parallelism  performance  of  Go,   compared  with  C,  using  measurements  and   analy-cal  models  (to  quan/fy  actual  and  predicted   performances  respec/vely)     5  
  • 6. Related  Work   •  Understanding  the  Off-­‐chip  Memory  Conten/on  of   Parallel  Programs  in  Mul/core  Systems  (B.M.  Tudor,   Y.M.  Teo,  2011)   •  A  Prac/cal  Approach  for  Performance  Analysis  of   Shared  Memory  Programs  (B.M.  Tudor,  Y.M.  Teo,  2011)   6   Parallelism  of  Shared-­‐memory  Program   Memory   Conten/on   Useful  Work   Data   Dependency  
  • 7. Related  Work:  Differences   7   Shared  Memory  Programs   Shared  Memory  Programs  Implicit   Parallelism     e.g.  Go   Explicit   Parallelism   e.g.  C  &  OpenMP   Processor  Architecture     Shared  Memory  Programs  Emerging   pladorms   e.g.  ARM   Mul/core   pladorms   e.g.  Intel,  AMD   Parallelism  Performance     Analy/cal    Models   Low  Memory   Conten/on   High  Memory   Conten/on  
  • 8. Contribu/ons   1.  Insights  about  the  parallelism  performance  of  Go   2.  Extend  our  analy/cal  parallelism  model  for   programs  with  lower  memory  conten/on   3.  Automate  performance  predic/on  and  model   valida/on  with  scripts     8  
  • 9. Outline     •  Mo/va/on   •  Related  Work     •  Methodology   –  Approach   –  Valida/on   •  Evalua/on   •  Conclusion   9  
  • 10. Process  Methodology   10   Analy/cal  Models   Baseline  Execu/ons   Parallelism  Traces   Parallelism  Traces   1.  Hardware   Counters   (Perf  Stat  3.0)   2.  Run  Queue   (Proc  Reader)   Parallelism   Predic/on   Go  Program  
  • 11. Analy/cal  Parallelism  Model     Parallelism  of  Shared-­‐memory  Program:     m  threads,  n  cores   Number of Threads: m Exploited Parallelism: π’Contention: M(n) Memory   Conten/on   Useful  Work   Data   Dependency   11  
  • 13. Non-­‐Uniform  Memory  Access  (24  cores):  Dual  six-­‐core  Intel  Xeon   X5650  2.67  GHz,  2  hardware  threads  per  core,  12MB  L3  cache,  16   GB  RAM,  running  Linux  Kernel  3.0     Experimental  Setup:  Machine   13  
  • 14. Outline     •  Mo/va/on   •  Related  Work     •  Methodology   –  Approach   –  Valida-on   •  Evalua/on   •  Conclusion   14  
  • 15. The  Memory  Conten/on  Model   SP  (Class  C)   15   9.7  
  • 16. Defini-on:  Low  conten6on   problems  have  a  conten/on   ≤  1.2     Observa-on:  Low  conten/on   problems  exhibt  a  W-­‐like   paEern  not  captured  by  the   model.     Why  does  this  occur?   Valida/on  of  Memory  Cont.  Model   Mandelbrot   Fannkuck-­‐Redux   Spectral  Norm   EP  (Class  C)   16  
  • 17. Original  Model:  Matrix  Mul   17   Modifica/on  of  Memory  Cont.  Model   Model  revalidated...     1.  For  Matrix  Mul/plica/on  (down  from  50%  error  to  7%)   2.  For  other  low  conten/on  programs     3.  In  Go  and  C   4.  On  Intel  and  ARM  mul/cores     Revised  Model:  Matrix  Mul  
  • 18. Outline     •  Mo/va/on   •  Related  Work     •  Methodology   –  Approach   –  Valida/on   •  Evalua-on   •  Conclusion   18  
  • 19. Performance  analysis:  Go  vs  C   1.  How  much  poorer  is  Go  compared  to  C?  Why?   –  Run/me,  speedup  vs  #Cores   2.  Could  Go  outperform  C?     –  Run/me  vs  Problem  size     –  Run/me  vs  #Threads   3.  Predictability  of  actual  performance   –  Modeled  vs  Measured   –  Conten/on  vs  #Cores   –  Prob.  size  vs  Exp.  Parallelism  /  Data  Dep.  /  Conten/on   19  
  • 20. Points  of  Comparison   20   Unop/mized   Op/mized   Compiler  Op/miza/on   Programmer  Op/miza/on   Experiment  1   Matrix  Mul/plica/on  (4992*4992)     No  op/miza/on  flags  (-­‐N  for  Go)   #threads  =  24   Go  is  comparable  with  C  
  • 21. Points  of  Comparison   21   Unop/mized   Op/mized   Compiler  Op/miza/on   Programmer  Op/miza/on   Experiment  1   Matrix  Mul/plica/on  (4992*4992)     No  op/miza/on  flags  (-­‐N  for  Go)   #threads  =  24   Go  is  comparable  with  C   Experiment  2   Matrix  Mul/plica/on  (4992*4992)     -­‐O3  op/miza/on  for  C,  No  flag  for  Go   #threads  =  24   Go  is  marginally  worse  than  C  
  • 22. Points  of  Comparison   22   Unop/mized   Op/mized   Compiler  Op/miza/on   Programmer  Op/miza/on   Experiment  1   Matrix  Mul/plica/on  (4992*4992)     No  op/miza/on  flags  (-­‐N  for  Go)   #threads  =  24   Experiment  2   Matrix  Mul/plica/on  (4992*4992)     -­‐O3  op/miza/on  for  C,  No  flag  for  Go   #threads  =  24   Go  is  marginally  slowerthan  C   Experiment  3   Transposed  Matrix  Mul/plica/on  (4992*4992)     -­‐O3  op/miza/on  for  C,  No  flag  for  Go   #threads  =  24   Go  is  much  worse  than  C    
  • 23. Observa-ons:   •  Sequen-al:  Go  is  16%  slower     •  Parallel:  Go  is  up  to  5%  faster   No  Op/miza/on:  Run/me  vs  #Cores   23   MatrixMul(#threads  =  24,  P  size  =  5K)     Effect  of  #cores  on  run/me   MatrixMul(#threads  =  24,  P  size  =  5K)     Effect  of  #cores  on  X  ra/o  
  • 24. Reasons   Observa-ons  (in  Go)   1.  Instruc-ons  executed:      12%  less   2.  #Cycles:      sequen/al  (16%  higher),      parallel  (5%  less)     3.  Cache  Misses:        sequen/al  (27x  worse),      parallel  (similar)     24   Conclusions   •  Go’s  poor  sequen/al  performance  caused   by  heavy  cache  miss  rate.  Likely  result  of   parallel  overhead.    
  • 25. Observa-ons:   •  Go  makes  up  for  poor  sequen/al  performance  with  a  higher  speedup.   •  Normalized  Go  speedup  is  marginally  beEer  (up  to  1.05x),  except  on  1/24  cores   (0.86x/0.97x)     No  Op/miza/on:  Parallelism  (Speedup)  vs  #Cores   25   MatrixMul(#threads  =  24,  P  size  =  5K)     Effect  of  #cores  on  speedup   MatrixMul(#threads  =  24,  P  size  =  5K)     Effect  of  #cores  on  norm.  speedup     (against  best  seq.  execu/on  /me)  
  • 26. Observa-ons:   •  Sequen-al:  Go  is  400%  slower     •  Parallel:  Go  is  180-­‐340%  slower   Both  Op/miza/ons:  Run/me  vs  #Cores   26   MatrixMul  –O3(#threads  =  24,  P  size  =  5K)   Effect  of  #cores  on  run/me   MatrixMul  –O3(#threads  =  24,  P  size  =  5K)   Effect  of  #cores  on  X  difference  
  • 27. Reasons   27   Observa-ons  (in  Go)   1.  Instruc-ons  executed:      5.2x  as  many   2.  #Cycles:      sequen/al  (400%  higher),      parallel  (180%  higher)     3.  Cache  Misses:        sequen/al  (64%  less),      parallel  (56%  less)     Conclusions   •  Go’s  op-miza-on  not  as  mature  as  C’s   Sequen/al  instruc/ons  reduced  1.3x  vs  8x,  cycles   reduced  4x  vs  18x     •  Go  has  beVer  cache  management    
  • 28. Observa-ons:   •  Go  speedup  is  higher  than  C’s  on  its  own  base,  but  significantly  worse  when  normalized.     •  Secondary  Objec-ve:  Given  that  Go  has  a  higher  own-­‐base  speedup,  could  it  beat  C  if  we   increase  the  problem  size?       Both  Op/miza/ons:  Parallelism  vs  #Cores   28   MatrixMul  –O3(#threads  =  24,  P  size  =  5K)   Effect  of  #cores  on  speedup   MatrixMul  –O3(#threads  =  24,  P  size  =  5K)   Effect  of  #cores  on  norm.  speedup  
  • 29. Observa-on:   •  Variance  in  the  /mes  ra/o  reduces  from  1.0-­‐1.3  to  1.0-­‐1.1   Conclusion:     •  In  general,  Go  is  increasingly  compe//ve  as  the  problem  size  increases.     Compiler  Op/miza/on:  Varying  Problem  Size   29   MatrixMul  –O3(#threads  =  24,  P  size  =  10K)   Effect  of  #cores  on  X  difference   MatrixMul  –O3(#threads  =  24,  P  size  =  5K)     Effect  of  #cores  on  X  difference  
  • 30. Both  Op/miza/ons:  Varying  Problem  Size   30   MatrixMul  –O3(#threads  =  24)  Effect   of   problem   size,   #cores   on   /mes   difference   Observa-on:   •  The  /mes  ra/o  decreases  as  the   problem  size  increases  on  1-­‐20   cores.     Conclusion:     •  There  is  a  valley  of  performance   on  intermediate  core  numbers.    
  • 31. Both  Op/miza/ons:    Run/me  vs  #threads   31   Observa-on:   •  Go’s  rela/ve  performance  as  the   #threads  increases.     Conclusions:   •  The  cost  of  gorou/nes  in  Go  is   extremely  low.   •  Go’s  performance  may  improve  on   problems  with  high  data  dependency.   MatrixMul  (#cores=  24,  Problem  size  =  5K)     Effect  of  #threads  on  run/me  
  • 32. Predictability  of  Actual  Performance   •  Objec-ve:  To  determine  how  Go  compares  to  C  with   regard  to  mul/core  predictability  as  we  change  the   #cores,  #threads,  problem  size   •  Observa-ons  (in  Go):     –  Model  exhibits  beEer  accuracy     –  Memory  Conten/on  does  not  fluctuate  as  #cores  changes     –  Measurements  consistent  with  assump/ons  as  problem  size   changes     •  Result:  Go  exhibts  proper/es  useful  for  predic/on  that   C  does  not.     32  
  • 33. Observa-ons   •  Conten/on  Error   –  C        (Avg:  15%,  Max:  55%  )   –  Go  (Avg:  3%,  Max:  14%)   •  Parallelism  Error   –  C        (Avg:  18%,  Max:  44%)   –  Go  (Avg:  6%,  Max:  15%)   •  Run/me  Error   –   C  (Avg:  16%,  Max:  47%)   –  Go  (Avg:  5%,  Max:  13%)   Conclusion   •  Go  has  a  beEer  predictability  than  C   Predictability  of  Performance   Modeled  vs  Measured   33   MatrixMul  –O3(#threads  =  24,  P=17K)   Effect  of  #cores  on  conten/on  factor  
  • 34. Observa-ons   •  In  C  ,  conten/on  flucuates  (0-­‐5.6)     •  Not  so  much  in  Go  (0-­‐1)   Conclusion     •  Garbage  Collec/on,  Channel  U/l     •  A  conten/on  factor  can  be  easily   bounded  in  Go  to  guarantee   performance  of  some  other  program.     Predictability  of  Performance   Conten/on  vs  #Cores   34   MatrixMul  –O3(#threads  =  24,  P=17K)   Effect  of  #cores  on  conten/on  factor  
  • 35. Predictability  of  Performance   Modeling  across  problem  sizes   •  Objec-ve:  Can  we  perform  measurements  on   smaller  problem  sizes  to  reduce  run/me  of   parallelism  predic/on?     35  
  • 36. Predictability  of  Performance   Problem  size  vs  Exploit.  Parallelism   36   Go  MatrixMul  (#threads  =  24,  P=17K)   Effect   of   problem   size   on   exploited   parallelsim     C   MatrixMul   (#threads   =   24,   P=17K)   Effect   of   problem   size   on   exploited   parallelsim     Observa-ons  (in  Go)   •  Exploited  Parallelism  only  decreases  slightly  as  problem  size  increases  
  • 37. Predictability  of  Performance   Problem  size  vs  Data  Dependency   37   Go  MatrixMul  (#threads  =  24,  P=17K)   Effect   of   problem   size   on   exploited   parallelsim     C   MatrixMul   (#threads   =   24,   P=17K)   Effect   of   problem   size   on   exploited   parallelsim     Observa-ons  (in  Go)   •  Data  Dependency  decreases  as  expected  as  problem  size  increases  
  • 38. Predictability  of  Performance   Problem  size  vs  Conten/on   38   Go  MatrixMul  (#threads  =  24,  P=17K)   Effect   of   problem   size   on   exploited   parallelsim     C   MatrixMul   (#threads   =   24,   P=17K)   Effect   of   problem   size   on   exploited   parallelsim     Observa-ons  (in  Go)   •  Memory  conten/on  only  increases  slightly  as  problem  size  increases   Conclusion:   •  Measurements  inputs  on  small  problems  are  more  accurate  in  Go  than  in  C  
  • 39. Conclusion   1.  How  does  Go  compare  to  C  in  a  mul-core  environment?     Go’s  Actual  Performance   –  Comparable  performance  before,  Inferior  performance  aver  programmer   op/miza/on   –  Consequence  of  different  levels  of  op/miza/on     –  Performance  margin  decreases  as  the  problem  size  increases  on  intermediate   core  numbers   –  Cost  of  gorou/nes  much  lower  than  threads   Go’s  Predicted  Performance     –  Model  exhibits  beEer  accuracy     –  Memory  Conten/on  does  not  fluctuate  as  #cores  changes     –  Measurements  consistent  with  assump/ons  as  problem  size  changes     39  
  • 40. Conclusion   2.  Is  the  model  extensible  beyond  C,  tradi-onal   mul-cores,  and  high  conten-on?   –  Modified  /  Validated  for  low  conten/on  problems   –  Validated  for  the  Go  language   –  Validated  for  ARM  devices     3.  Can  we  make  the  model  easier  to  use?   –  Formally  defined  valida/on  criteria   –  Wrote  script  to  perform  model  valida/on   –  Wrote  script  to  perform  performance  predic/on     –  *Future  Work*  Front  end  for  predic/on   40  
  • 41. Observa-ons:   •  Sequen-al:  Go  is  31%  slower     •  Parallel:  Go  is  up  to  0-­‐28%  slower   •  On  UMA,  /mes  ra/o  decreases  as  #cores  increases     Compiler  Op/miza/on:  Run/me  vs  #Cores   41   MatrixMul  –O3  (#threads  =  24,  P  size  =  5K)     Effect  of  #cores  on  run/me   MatrixMul  –O3  (#threads  =  24,  P  size  =  5K)   Effect  of  #cores  on  X  difference  
  • 42. Reasons   42   Observa-ons  (in  Go)   1.  Instruc-ons  executed:      4.5x  as  many   2.  #Cycles:      sequen/al  (30%  higher),      parallel  (similar)     3.  Cache  Misses:        sequen/al  (10%  higher),      parallel  (46%  less)    
  • 43. Observa-ons:   •  Go  speedup  is  higher  than  C’s  on  its  own  base,  but  lower  when  normalized.     •  Secondary  Objec-ve:  Given  that  Go  has  a  higher  own-­‐base  speedup,  could  it  beat   C  if  we  increase  the  problem  size?       Compiler  Op/miza/on:  Parallelism  vs  #Cores   43   MatrixMul  –O3(#threads  =  24,  P  size  =  5K)     Effect  of  #cores  on  Exp.  Parallelism   MatrixMul  –O3(#threads  =  24,  P  size  =  5K)     Effect  of  #cores  on  norm.  speedup  
  • 44. Sequen/al  Op/miza/on   44   No  op/miza/on     Compiler  op/miza/on     Compiler  +  Programmer  op/miza/on    
  • 45. Predictability  of  Performance   Modeling  across  problem  sizes   •  Objec-ve:  Can  we  perform  measurements  on  smaller   problem  sizes  to  reduce  run/me  of  parallelism   predic/on?     •  Observa-on:  The  performance  profiles  in  Go  are   consistent  with  expecta/ons  as  problem  size  changes     •  Result:    Measurements  inputs  on  small  problems  are   more  accurate  in  Go  than  in  C   45