SlideShare a Scribd company logo
Directive‐based Auto‐tuning 
for the Finite Difference Method 
on the Xeon Phi
Takahiro Katagiri, Satoshi Ohshima,
Masaharu Matsumoto
(Information Technology Center, The University of Tokyo)
1
iWAPT2015, Hyderabad International Convention Centre, Hyderabad, INDIA
Session 2, 13:30 ‐14:00
May 29th, 2015 
Outline
1. Background and ppOpen‐AT Functions
2. Code Optimization Strategy
3. Performance Evaluation
4. Conclusion
Outline
1. Background and ppOpen‐AT Functions
2. Code Optimization Strategy
3. Performance Evaluation
4. Conclusion
Background
• High‐Thread Parallelism (HTP)
– Multi‐core and many‐core processors are 
pervasive. 
• Multicore CPUs: 16‐32 cores, 32‐64 Threads with 
Hyper Threading (HT) or Simultaneous Multithreading 
(SMT)
• Many Core CPU: Xeon Phi – 60 cores, 240 Threads 
with HT.
– Utilizing parallelism with full‐threads is 
important.
4
 Performance Portability (PP)
◦ Keeping high performance in multiple computer environments.
 Not only multiple CPUs, but also multiple compilers.
 Run‐time information, such as loop length and 
number of threads, is important.
◦ Auto‐tuning (AT) is one of candidates technologies to establish 
PP in multiple computer environments. 
ppOpen‐HPC Project
• Middleware for HPC and Its AT Technology
– Supported by JST, CREST, from FY2011 to FY2015.
– PI: Professor Kengo Nakajima (U. Tokyo)
• ppOpen‐HPC 
– An open source infrastructure for reliable simulation codes on 
post‐peta (pp) scale parallel computers.
– Consists of various types of libraries, which covers 
5 kinds of discretization methods for scientific computations. 
• ppOpen‐AT 
– An auto‐tuning language for ppOpen‐HPC codes 
– Using knowledge of previous project: ABCLibScript Project.
– Auto‐tuning language based on directives for AT.
5
Software Architecture of 
ppOpen‐HPC
6
FVM DEMFDMFEM
Many‐core 
CPUs
GPUs
Multi‐core
CPUs
MG
COMM
Auto‐Tuning Facility
Code Generation for Optimization Candidates
Search for the best candidate
Automatic Execution for the optimization
ppOpen‐APPL
ppOpen‐MATH
BEM
ppOpen‐AT
User Program
GRAPH VIS MP
STATIC DYNAMIC
ppOpen‐SYS FT
Optimize 
memory 
accesses 
Design Philosophy of ppOpen‐AT 
• A Directive‐base AT Language
– Multiple regions can be specified.
– Low cost description with directives.
– Do not prevent original code execution.
• AT Framework Without Script Languages and 
OS Daemons (Static Code Generation)
– Simple: Minimum software stack is required. 
– Useful: Our targets are supercomputers in operation, 
or just after development.
– Low Overhead: We DO NOT use code generator with 
run‐time feedback.
– Reasons: 
• Since supercomputer centers do not accept any OS kernel 
modification, and any daemons in login or compute nodes. 
• Need to reduce additional loads of the nodes.
• Environmental issues, such as batch job script, etc. 7
ppOpen‐AT System
ppOpen‐APPL /*
ppOpen‐AT
Directives
User 
KnowledgeLibrary 
Developer
① Before 
Release‐time
Candidate
1
Candidate
2
Candidate
3
Candidate
nppOpen‐AT
Auto‐Tuner
ppOpen‐APPL / *
Automatic
Code
Generation②
:Target 
Computers
Execution Time④
Library User
③
Library Call
Selection
⑤
⑥
Auto‐tuned
Kernel
Execution
Run‐
time
This user 
benefited 
from AT.
Scenario of AT for ppOpen‐APPL/FDM
9
Execution with optimized
kernels without AT process.
Library User
Set AT parameter,
and execute the library
(OAT_AT_EXEC=1)
■Execute auto-tuner:
With fixed loop lengths
(by specifying problem size and number of
MPI processes and OpenMP threads)
Measurement of timings for target kernels.
Store the best candidate information.
Set AT parameters, and
execute the library
(OAT_AT_EXEC=0)
Store the fastest kernel
information
Using the fastest kernel without AT 
(except for varying problem size and 
number of MPI processes and OpenMP 
threads.)
Specify problem size,
number of MPI processes
and OpenMP threads.
AT Timings of ppOpen‐AT (FIBER Framework)
OAT_ATexec()
…
do i=1, MAX_ITER
Target_kernel_k()
…
Target_kernel_m()
…
enddo
Read the best parameter
Is this first call?
Yes
Read the best parameter
Is this first call?
Yes
One time execution (except 
for varying problem size and 
number of MPI processes )
AT for Before Execute‐time
Execute Target_Kernel_k() with varying 
parameters
Execute Target_Kernel_m() with varying 
parameters
parameter 
Store the best 
parameter 
…
Outline
1. Background and ppOpen‐AT Functions
2. Code Optimization Strategy
3. Performance Evaluation
4. Conclusion
Target Application
• Seism_3D: 
Simulation for seismic wave analysis.
• Developed by Professor Furumura 
at the University of Tokyo.
–The code is re‐constructed as 
ppOpen‐APPL/FDM.
• Finite Differential Method (FDM) 
• 3D simulation
–3D arrays are allocated.
• Data type: Single Precision (real*4) 12
Flow Diagram of 
ppOpen‐APPL/FDM
13
),,,(
}],,,)
2
1
({},,)
2
1
({[
1
),,(
2/
1
zyxqp
zyxmxzyxmxc
x
zyx
dx
d
pqpq
M
m
m
pq



 


 Space difference by FDM.
),,(,
12
1
2
1
zyxptf
zyx
uu n
p
n
zp
n
yp
n
xpn
p
n
p 

















 


 Explicit time expansion by central 
difference.
Initialization
Velocity Derivative (def_vel)
Velocity Update (update_vel)
Stress Derivative (def_stress)
Stress Update (update_stress)
Stop Iteration?
NO
YES
End
Velocity PML condition (update_vel_sponge)
Velocity Passing (MPI) (passing_vel)
Stress PML condition (update_stress_sponge)
Stress Passing (MPI) (passing_stress)
Target Loop Characteristics
• Triple‐nested loops
!$omp parallel do 
do k = NZ00, NZ01
do j = NY00, NY01
do i = NX00, NX01
<Codes from FDM>
end do
end do
end do
!$omp end parallel do
OpenMP directive to 
the outer loop (Z‐axis)
Loop lengths are varied 
according to problem size, 
the number of 
MPI processes and 
OpenMP threads. 
The codes can be separable 
by loop split.
What is separable codes?
Variable definitions and references are separated.
There is a flow‐dependency, but 
no data dependency between each other.
d1 = …
d2 = …
…
dk = …
…
… = … d1 …
… = … d2 …
…
… = … dk …
Variable 
definitions
Variable 
references
d1 = …
d2 = …
… = … d1 …
… = … d2 …
…
…
dk = …
…
… = … dk …
Split to
Two Parts
ppOpen‐AT Directives
: Loop Split & Fusion with data‐flow dependence  
16
!oat$ install LoopFusionSplit region start
!$omp parallel do private(k,j,i,STMP1,STMP2,STMP3,STMP4,RL,RM,RM2,RMAXY,RMAXZ,RMAYZ,RLTHETA,QG)
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
RL  = LAM (I,J,K);   RM  = RIG (I,J,K);   RM2 = RM + RM
RLTHETA  = (DXVX(I,J,K)+DYVY(I,J,K)+DZVZ(I,J,K))*RL
!oat$ SplitPointCopyDef  region  start 
QG  = ABSX(I)*ABSY(J)*ABSZ(K)*Q(I,J,K)
!oat$ SplitPointCopyDef  region  end
SXX (I,J,K) = ( SXX (I,J,K) + (RLTHETA + RM2*DXVX(I,J,K))*DT )*QG
SYY (I,J,K) = ( SYY (I,J,K) + (RLTHETA + RM2*DYVY(I,J,K))*DT )*QG
SZZ (I,J,K) = ( SZZ (I,J,K) + (RLTHETA + RM2*DZVZ(I,J,K))*DT )*QG
!oat$ SplitPoint  (K, J, I)
STMP1 = 1.0/RIG(I,J,K);  STMP2 = 1.0/RIG(I+1,J,K);  STMP4 = 1.0/RIG(I,J,K+1)
STMP3 = STMP1 + STMP2
RMAXY = 4.0/(STMP3 + 1.0/RIG(I,J+1,K) + 1.0/RIG(I+1,J+1,K))
RMAXZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I+1,J,K+1))
RMAYZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I,J+1,K+1))
!oat$ SplitPointCopyInsert
SXY (I,J,K) = ( SXY (I,J,K) + (RMAXY*(DXVY(I,J,K)+DYVX(I,J,K)))*DT )*QG
SXZ (I,J,K) = ( SXZ (I,J,K) + (RMAXZ*(DXVZ(I,J,K)+DZVX(I,J,K)))*DT )*QG
SYZ (I,J,K) = ( SYZ (I,J,K) + (RMAYZ*(DYVZ(I,J,K)+DZVY(I,J,K)))*DT )*QG
END DO;  END DO;  END DO
!$omp end parallel do
!oat$ install LoopFusionSplit region end
Re‐calculation is defined.
Using the re‐calculation 
is defined.
Loop Split Point
Specify Loop Split and Loop Fusion
Re‐ordering of Statements
:  increase a chance to optimize 
register allocation by compiler
17
!OAT$ RotationOrder sub region start
Sentence i
Sentence ii
!OAT$ RotationOrder sub region end
!OAT$ RotationOrder sub region start
Sentence 1
Sentence 2
!OAT$ RotationOrder sub region end
Sentence i
Sentence 1
Sentence ii
Sentence 2
Generated Code
The Kernel 1 (update_stress)
• m_stress.f90(ppohFDM_update_stress)
!OAT$ call OAT_BPset("NZ01")
!OAT$ install LoopFusionSplit region start
!OAT$ name ppohFDMupdate_stress
!OAT$ debug (pp)
!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,
DXVZDZVX1,DYVZDZV1)
do k = NZ00, NZ01
do j = NY00, NY01
do i = NX00, NX01
RL1   = LAM (I,J,K)
!OAT$ SplitPointCopyDef sub region start
RM1   = RIG (I,J,K)
!OAT$ SplitPointCopyDef sub region end
RM2   = RM1 + RM1;  RLRM2 = RL1+RM2; 
DXVX1 = DXVX(I,J,K);  DYVY1 = DYVY(I,J,K); DZVZ1 = DZVZ(I,J,K)
D3V3  = DXVX1 + DYVY1 + DZVZ1
SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)‐RM2*(DZVZ1+DYVY1) ) * DT
SYY (I,J,K) = SYY (I,J,K)  + (RLRM2*(D3V3)‐RM2*(DXVX1+DZVZ1) ) * DT
SZZ (I,J,K) = SZZ (I,J,K)  + (RLRM2*(D3V3)‐RM2*(DXVX1+DYVY1) ) * DT
Measured B/F=3.2
The Kernel 1 (update_stress)
• m_stress.f90(ppohFDM_update_stress)
!OAT$ SplitPoint (K,J,I)
!OAT$ SplitPointCopyInsert
DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K)
DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K)
DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)
SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT
SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT
SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
end do
end do
end do
!$omp end parallel do
!OAT$ install LoopFusionSplit region end
Automatic Generated Codes for 
the kernel 1
ppohFDM_update_stress
 #1 [Baseline]: Original 3-nested Loop
 #2 [Split]: Loop Splitting with K-loop
(Separated, two 3-nested loops)
 #3 [Split]: Loop Splitting with J-loop
 #4 [Split]: Loop Splitting with I-loop
 #5 [Split&Fusion]: Loop Fusion to #1 for K and J-loops
(2-nested loop)
 #6 [Split&Fusion]: Loop Fusion to #2 for K and J-Loops
(2-nested loop)
 #7 [Fusion]: Loop Fusion to #1
(loop collapse)
 #8 [Split&Fusion]: Loop Fusion to #2
(loop collapse, two one-nest loop)
The Kernel 2 (update_vel)
• m_velocity.f90(ppohFDM_update_vel)
!OAT$ install LoopFusion region start
!OAT$ name ppohFDMupdate_vel
!OAT$ debug (pp)
!$omp parallel do private(k,j,i,ROX,ROY,ROZ)
do k = NZ00, NZ01
do j = NY00, NY01
do i = NX00, NX01
! Effective Density
!OAT$ RotationOrder sub region start
ROX = 2.0_PN/( DEN(I,J,K) + DEN(I+1,J,K) )
ROY = 2.0_PN/( DEN(I,J,K) + DEN(I,J+1,K) )
ROZ = 2.0_PN/( DEN(I,J,K) + DEN(I,J,K+1) )
!OAT$ RotationOrder sub region end
!OAT$ RotationOrder sub region start
VX(I,J,K) = VX(I,J,K) + ( DXSXX(I,J,K)+DYSXY(I,J,K)+DZSXZ(I,J,K) )*ROX*DT
VY(I,J,K) = VY(I,J,K) + ( DXSXY(I,J,K)+DYSYY(I,J,K)+DZSYZ(I,J,K) )*ROY*DT
VZ(I,J,K) = VZ(I,J,K) + ( DXSXZ(I,J,K)+DYSYZ(I,J,K)+DZSZZ(I,J,K) )*ROZ*DT
!OAT$ RotationOrder sub region end
end do;  end do;  end do
!$omp end parallel do
!OAT$ install LoopFusion region end
Measured B/F=1.7
Automatic Generated Codes for 
the kernel 2
ppohFDM_update_vel
• #1 [Baseline]: Original 3‐nested Loop.
• #2 [Fusion]:  Loop Fusion for K and J‐Loops.
(2‐nested loop)
• #3 [Fusion]: Loop Split for K, J, and I‐Loops.
(Loop Collapse)
• #4 [Fusion&Re‐order]:  
Re‐ordering of sentences to #1.   
• #5 [Fusion&Re‐order]:  
Re‐ordering of sentences to #2. 
• #6 [Fusion&Re‐order]:  
Re‐ordering of sentences to #3. 
The Kernel 3 (update_stress_sponge)
• m_stress.f90(ppohFDM_update_stress_sponge)
!OAT$ call OAT_BPset("NZ01")
!OAT$ install LoopFusion region start
!OAT$ name ppohFDMupdate_sponge
!OAT$ debug (pp)
!$omp parallel do private(k,gg_z,i,gg_y,gg_yz,i,gg_x,gg_xyz)
do k = NZ00, NZ01
gg_z = gz(k)
do j = NY00, NY01
gg_y = gy(j); gg_yz = gg_y * gg_z; 
do i = NX00, NX01
gg_x = gx(i);  gg_xyz = gg_x * gg_yz; 
SXX(I,J,K)   = SXX(I,J,K) * gg_xyz;  SYY(I,J,K)   = SYY(I,J,K) * gg_xyz
SZZ(I,J,K)   = SZZ(I,J,K) * gg_xyz;   SXY(I,J,K)   = SXY(I,J,K) * gg_xyz
SXZ(I,J,K)   = SXZ(I,J,K) * gg_xyz;   SYZ(I,J,K)   = SYZ(I,J,K) * gg_xyz
end do
end do
end do
!$omp end parallel do
!OAT$ install LoopFusion region end
Measured B/F=6.8
Automatic Generated Codes for
the kernel 3
ppohFDM_update_sponge
• #1 [Baseline]: Original 3‐nested loop
• #2 [Fusion]:  Loop Fusion for K and J‐loops
(2‐nested loop)
• #3 [Fusion]: Loop Split for K, J, and I‐loops
(Loop Collapse)
Outline
1. Background and ppOpen‐AT Functions
2. Code Optimization Strategy
3. Performance Evaluation
4. Conclusion
An Example of Seism_3D Simulation 
 West part earthquake in Tottori prefecture in Japan 
at year 2000. ([1], pp.14)
 The region of 820km x 410km x 128 km is discretized with 0.4km.
 NX x NY x NZ = 2050 x 1025 x 320 ≒ 6.4 : 3.2 : 1.
[1] T. Furumura, “Large‐scale Parallel FDM Simulation for Seismic Waves and Strong Shaking”, Supercomputing News, 
Information Technology Center, The University of Tokyo, Vol.11, Special Edition 1, 2009. In Japanese.
Figure : Seismic wave translations in west part earthquake in Tottori prefecture in Japan. 
(a) Measured waves; (b) Simulation results; (Reference : [1] in pp.13)  
AT Candidates in This Experiment
1. Kernel update_stress
– 8 Kinds of Candidates with Loop fusion and Loop Split.
2. Kernel update_vel
– 6 Kinds of Candidates with Loop Fusion and Re‐ordering of 
Statements.
 3 Kinds of Candidates with Loop Fusion.
3. Kernel update_stress_sponge
4. Kernel update_vel_sponge
5. Kernel ppohFDM_pdiffx3_p4
6. Kernel ppohFDM_pdiffx3_m4
7. Kernel ppohFDM_pdiffy3_p4
8. Kernel ppohFDM_pdiffy3_m4
9. Kernel ppohFDM_pdiffz3_p4
10. Kernel ppohFDM_pdiffz3_m4 27
3 Kinds of Candidates with Loop Fusion for 
Data Packing and Data Unpacking.
11. Kernel ppohFDM_ps_pack
12. Kernel ppohFDM_ps_unpack
13. Kernel ppohFDM_pv_pack
14. Kernel ppohFDM_pv_unpack
28
AT Candidates in This Experiment
(Cont’d)
ppohFDM_ps_pack (ppOpen‐APPL/FDM)
• m_stress.f90(ppohFDM_ps_pack)
!OAT$ call OAT_BPset("NZ01")
!OAT$ install LoopFusion region start
!OAT$ name ppohFDMupdate_sponge
!OAT$ debug (pp)
!$omp parallel do private(k,j,iptr,i)
do k=1, NZP
iptr = (k‐1)*3*NYP*NL2
do j=1, NYP
do i=1, NL2
i1_sbuff(iptr+1) = SXX(NXP‐NL2+i,j,k)
i1_sbuff(iptr+2) = SXY(NXP‐NL2+i,j,k)
i1_sbuff(iptr+3) = SXZ(NXP‐NL2+i,j,k)
i2_sbuff(iptr+1) = SXX(i,j,k)
i2_sbuff(iptr+2) = SXY(i,j,k)
i2_sbuff(iptr+3) = SXZ(i,j,k)
iptr = iptr + 3
end do
end do
end do
!$omp end parallel do
ppohFDM_ps_pack (ppOpen‐APPL/FDM)
• m_stress.f90(ppohFDM_ps_pack)
!$omp parallel do private(k,j,iptr,i)
do k=1, NZP
iptr = (k‐1)*3*NL2*NXP
do j=1, NL2
do i=1, NXP
j1_sbuff(iptr+1) = SXY(i,NYP‐NL2+j,k);   j1_sbuff(iptr+2) = SYY(i,NYP‐NL2+j,k)
j1_sbuff(iptr+3) = SYZ(i,NYP‐NL2+j,k)
j2_sbuff(iptr+1) = SXY(i,j,k);   j2_sbuff(iptr+2) = SYY(i,j,k); j2_sbuff(iptr+3) = SYZ(i,j,k)
iptr = iptr + 3
end do;   end do;   end do
!$omp end parallel do
!$omp parallel do private(k,j,iptr,i)
do k=1, NL2
iptr = (k‐1)*3*NYP*NXP
do j=1, NYP
do i=1, NXP
k1_sbuff(iptr+1) = SXZ(i,j,NZP‐NL2+k);  k1_sbuff(iptr+2) = SYZ(i,j,NZP‐NL2+k)
k1_sbuff(iptr+3) = SZZ(i,j,NZP‐NL2+k)
k2_sbuff(iptr+1) = SXZ(i,j,k);   k2_sbuff(iptr+2) = SYZ(i,j,k); k2_sbuff(iptr+3) = SZZ(i,j,k)
iptr = iptr + 3
end do;  end do;  end do
!$omp end parallel do
!OAT$ install LoopFusion region end
Machine Environment 
(8 nodes of the Xeon Phi)
 The Intel Xeon Phi 
 Xeon Phi 5110P (1.053 GHz), 60 cores
 Memory Amount:8 GB (GDDR5)
 Theoretical Peak Performance:1.01 TFLOPS
 One board per node of the Xeon phi cluster
 InfiniBand FDR x 2 Ports 
 Mellanox Connect‐IB
 PCI‐E Gen3 x16
 56Gbps x 2
 Theoretical Peak bandwidth 13.6GB/s
 Full‐Bisection
 Intel MPI
 Based on MPICH2, MVAPICH2
 4.1 Update 3 (build 048)
 Compiler:Intel Fortran version 14.0.0.080 Build 20130728
 Compiler Options:
‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel –mmic
‐align array64byte
 KMP_AFFINITY=granularity=fine, balanced (Uniform Distribution of threads 
between sockets)
Execution Details
• ppOpen‐APPL/FDM ver.0.2
• ppOpen‐AT ver.0.2
• The number of time step: 2000 steps
• The number of nodes: 8 node
• Native Mode Execution
• Target Problem Size 
(Almost maximum size with 8 GB/node)
– NX * NY * NZ = 1536 x  768 x 240 / 8 Node
– NX * NY * NZ = 768 * 384 * 120 / node
(!= per MPI Process)
• The number of iterations for kernels 
to do auto‐tuning: 100
Execution Details of Hybrid 
MPI/OpenMP
• Target MPI Processes and OMP Threads on the Xeon Phi
– The Xeon Phi with 4 HT (Hyper Threading) 
– PX TY: X MPI Processes and Y Threads per process
– P8T240 : Minimum Hybrid MPI/OpenMP execution for 
ppOpen‐APPL/FDM, since it needs minimum 8 MPI Processes.
– P16T120
– P32T60
– P64T30
– P128T15  
– P240T8
– P480T4
– Less than P960T2 cause an MPI error in this environment.
#
0
#
1
#
2
#
3
#
4
#
5
#
6
#
7
#
8
#
9
#
1
0
#
1
1
#
1
2
#
1
3
#
1
4
#
1
5
P2T8
#
0
#
1
#
2
#
3
#
4
#
5
#
6
#
7
#
8
#
9
#
1
0
#
1
1
#
1
2
#
1
3
#
1
4
#
1
5
P4T4 Target of cores for 
one MPI Process
Loop Length per thread (Z‐axis)
(8 Nodes, 1536x768x240 / 8 Node )
0.5 1 2 2
4
6
12
Loop length per thread
MPI processes for Z‐axis are different 
between each MPI/OpenMP execution
due to software restriction.
PERFORMANCE ON 
WHOLE TIME
0
200
400
600
800
1000
1200
P8T240 P16T120 P32T60 P64T30 P128T15 P240T8 P480T4
Others
IO
passing_stress
passing_velocity
update_vel_spo
nge
update_vel
update_stress_s
ponge
update_stress
def_stress
def_vel
Whole Time (without AT)[Seconds]
Comp.
Time
Comm.
Time
NX*NY*NZ = 1536x768x240/ 8 Node
Hybrid MPI/OpenMP
Phi : 8 Nodes (1920 Threads)
0
200
400
600
800
1000
1200
P8T240 P16T120 P32T60 P64T30 P128T15 P240T8 P480T4
Others
IO
passing_stress
passing_velocity
update_vel_spon
ge
update_vel
update_stress_s
ponge
update_stress
def_stress
def_vel
51.5% Speedups 
Whole Time(with AT)[Seconds]
Comp.
Time
Comm.
Time
12.1% Speedups 
3.3% Speedups 
5.9% Speedups 
4.8% Speedups 
51.7% Speedups 
40.0% Speedups 
NX*NY*NZ = 1536x768x240/ 8 Node
Hybrid MPI/OpenMP
Phi : 8 Nodes (1920 Threads)
Maximum Speedups by AT
(Xeon Phi, 8 Nodes)
558
200
171
30 20 51
Speedup [%]
Kinds of Kernels
Speedup  =  
max ( Execution time of original code / Execution time with AT )
for all combination of Hybrid MPI/OpenMP Executions (PXTY) 
NX*NY*NZ = 1536x768x240/ 8 Node
BREAK DOWN OF TIMINGS 
2.07 
1.08  1.48  1.30 
5.58 
4.65 
5.10 
0
2
4
6
P8T240 P16T120 P32T60 P64T30 P128T15 P240T8 P480T4
Speedups
AT Effect (update_stress)
(Accumulated time for 2000 steps )
113.41 
41.05  47.26  42.21 
218.49  217.49 
231.38 
54.81  38.12  32.02  32.40  39.16  46.76  45.37 
0
50
100
150
200
250
P8T240 P16T120 P32T60 P64T30 P128T15 P240T8 P480T4
Without AT With AT
[Seconds]
Best SW:6 Best SW:5 Best SW:5 Best SW:4 Best SW:6 Best SW:5 Best SW:6
5.6x
AT Effect for Pack and Unpack Optimization
(Whole time)
1,106 
731 
623 
493  521  590  503 
1,053 
721 
495  449  423  397  431 
0
500
1000
1500
P8T240 P16T120 P32T60 P64T30 P128T15 P240T8 P480T4
Without data pack and unpack AT
With data pack and unpack AT
[Seconds]
1.05  1.01 
1.26 
1.10  1.23 
1.49 
1.17 
0
0.5
1
1.5
2
P8T240 P16T120 P32T60 P64T30 P128T15 P240T8 P480T4
Speedups
1.5x
AT Effect for Re‐ordering of Statements
(update_vel, 100 times)
4.00 
2.09 
2.04 
2.58 
3.09  3.49  3.06 
4.20 
2.06 
2.04 
2.60 
3.38  3.82  3.24 
2.27 
2.08 
2.04 
2.65 
2.27 
2.69 
2.03 
0
1
2
3
4
5
P8T240 P16T120 P32T60 P64T30 P128T15 P240T8 P480T4
Baseline
Re‐orderling
Re‐ordering + loop fusion (nested two)
[Seconds]
1.76 
1.01  1.00  0.97 
1.36  1.30 
1.51 
0
0.5
1
1.5
2
P8T240 P16T120 P32T60 P64T30 P128T15 P240T8 P480T4
Speedups by re‐ordering and loop fusion
1.8x
Only re‐ordering is not so effective.
But applying loop fusion at the same time 
is effective. 
COMPARISON WITH 
DIFFERENT CPU ARCHITECTURES
AT Effect (without AT)
1
1.05 
1.03 
0.97
0.98
0.99
1
1.01
1.02
1.03
1.04
1.05
1.06
Speedup to Phi
Xeon Phi Ivy Bridge_HT1 FX10
P240T8 P80T1 P128T1
Fast
8 Nodes
Whole Time
NX*NY*NZ 
= 1536x768x240
/ 8 Node
Hybrid MPI/OpenMP
Phi : 8 Nodes (1920 Threads)
Ivy : 8 Nodes (80 Threads)
FX10 : 8 Nodes (128 Threads)
AT Effect (with AT)
1
0.78 
0.68 
0
0.2
0.4
0.6
0.8
1
1.2
Speedup to Phi
Xeon Phi Ivy Bridge_HT1 FX10
P240T8 P80T2 P128T1
Fast
Xeon Phi 
is the fastest if we apply AT
8 NodesWhole Time
NX*NY*NZ 
= 1536x768x240
/ 8 Node
AUTO‐TUNING TIME AND
THE BEST IMPLEMENTATION
Auto‐tuning Time
553
303 320
0
100
200
300
400
500
600
Xeon Phi Ivy Bridge FX10
AT Time [seconds]
update_stress
10.9%
8.3% 10.1%
AT overhead to total time 
(2000 time steps)  426
189 185
0
100
200
300
400
500
Xeon Phi Ivy Bridge FX10
AT Time [seconds]
update_vel
8.4%
5.2% 5.8%
AT overhead to total time 
(2000 time steps) 
なぜ、カーネル数が違うか不明
(バージョンが違う?)
Parameter Distribution (update_stress)
8 Nodes
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6 7 8
Phi Ivy Bridge_HT1 FX10
MPI Parallelism
Parameter Value
NX*NY*NZ = 1536x768x240/ 8 Node
Parameter Distribution (update_vel)
8 Nodes
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7 8
Phi Ivy Bridge_HT1 FX10
MPI Parallelism
Parameter Value
NX*NY*NZ = 1536x768x240/ 8 Node
Parameter Distribution (update_stress_sponge)
8 Nodes
0
1
2
3
4
0 1 2 3 4 5 6 7 8
Phi Ivy Bridge_HT1 FX10
MPI Parallelism
Parameter Value
NX*NY*NZ = 1536x768x240/ 8 Node
The Fastest Code (update_stress)
 Xeon Phi (P240T8)
#5 [Split&Fusion]: Loop 
Fusion to #1 for K and J‐loops 
(2‐nested loop)
!$omp parallel do private 
(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V
3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)
DO k_j = 1 , (NZ01‐NZ00+1)*(NY01‐NY00+1)
k = (k_j‐1)/(NY01‐NY00+1) + NZ00;  
j = mod((k_j‐1),(NY01‐NY00+1)) + NY00; 
DO i = NX00, NX01
RL1 = LAM (I,J,K);  RM1 = RIG (I,J,K);  
RM2 = RM1 + RM1; RLRM2 = RL1+RM2;
DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K); 
DZVZ1 = DZVZ(I,J,K); 
D3V3  = DXVX1 + DYVY1 + DZVZ1;
SXX (I,J,K) = SXX (I,J,K) 
+ (RLRM2*(D3V3)‐RM2*(DZVZ1+DYVY1) ) * DT
SYY (I,J,K) = SYY (I,J,K) 
+ (RLRM2*(D3V3)‐RM2*(DXVX1+DZVZ1) ) * DT
SZZ (I,J,K) = SZZ (I,J,K) 
+ (RLRM2*(D3V3)‐RM2*(DXVX1+DYVY1) ) * DT
DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K); 
DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K); 
DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)
SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT
SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT
SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
END DO
END DO
!$omp end parallel do
 Ivy Bridge(P80T2)  FX10 (P128T1)
#1 [Baseline]: Original Loop#4 [Split]: Loop Splitting 
with I‐loop 
!$omp parallel do private 
(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V
3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)
do k = NZ00, NZ01
do j = NY00, NY01
do i = NX00, NX01
RL1   = LAM (I,J,K);  RM1   = RIG (I,J,K);
RM2   = RM1 + RM1; RLRM2 = RL1+RM2
DXVX1 = DXVX(I,J,K);  DYVY1 = DYVY(I,J,K)
DZVZ1 = DZVZ(I,J,K)
D3V3  = DXVX1 + DYVY1 + DZVZ1
SXX (I,J,K) = SXX (I,J,K) 
+ (RLRM2*(D3V3)‐RM2*(DZVZ1+DYVY1) ) * DT
SYY (I,J,K) = SYY (I,J,K) 
+ (RLRM2*(D3V3)‐RM2*(DXVX1+DZVZ1) ) * DT
SZZ (I,J,K) = SZZ (I,J,K) 
+ (RLRM2*(D3V3)‐RM2*(DXVX1+DYVY1) ) * DT
end do
do i = NX00, NX01
RM1   = RIG (I,J,K)
DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K)
DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K)
DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)
SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT
SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT
SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
end do
end do
end do
!$omp end parallel do
!$omp parallel do private 
(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V
3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)
do k = NZ00, NZ01
do j = NY00, NY01
do i = NX00, NX01
RL1   = LAM (I,J,K);  RL1   = LAM (I,J,K)
RM1   = RIG (I,J,K);  RM2   = RM1 + RM1
RLRM2 = RL1+RM2; 
DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K)
DZVZ1 = DZVZ(I,J,K)
D3V3  = DXVX1 + DYVY1 + DZVZ1
SXX (I,J,K) = SXX (I,J,K) 
+ (RLRM2*(D3V3)‐RM2*(DZVZ1+DYVY1) ) * DT
SYY (I,J,K) = SYY (I,J,K) 
+ (RLRM2*(D3V3)‐RM2*(DXVX1+DZVZ1) ) * DT
SZZ (I,J,K) = SZZ (I,J,K) 
+ (RLRM2*(D3V3)‐RM2*(DXVX1+DYVY1) ) * DT
DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K)
DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K)
DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)
SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DT
SXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DT
SYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
end do
end do
end do
!$omp end parallel do
The Fastest Code (update_vel)
 Xeon Phi (P240T8)
#5 [Fusion&Re‐order]:  
Re‐ordering of sentences to 
#2. 
!$omp parallel do private (i,j,k,ROX,ROY,ROZ)
DO k_j = 1 , (NZ01‐NZ00+1)*(NY01‐NY00+1)
k = (k_j‐1)/(NY01‐NY00+1) + NZ00
j = mod((k_j‐1),(NY01‐NY00+1)) + NY00
do i = NX00, NX01
ROX = 2.0_PN/( DEN(I,J,K) + DEN(I+1,J,K) )
VX(I,J,K) = VX(I,J,K) &
+ ( DXSXX(I,J,K)+DYSXY(I,J,K)+DZSXZ(I,J,K) )*ROX*DT
ROY = 2.0_PN/( DEN(I,J,K) + DEN(I,J+1,K) )
VY(I,J,K) = VY(I,J,K) &
+ ( DXSXY(I,J,K)+DYSYY(I,J,K)+DZSYZ(I,J,K) )*ROY*DT
ROZ = 2.0_PN/( DEN(I,J,K) + DEN(I,J,K+1) )
VZ(I,J,K) = VZ(I,J,K) &
+ ( DXSXZ(I,J,K)+DYSYZ(I,J,K)+DZSZZ(I,J,K) )*ROZ*DT
end do
end do
!$omp end parallel do
 Ivy Bridge(P80T2)  FX10 (P128T1)
#1 [Baseline]: Original Loop#5 [Fusion&Re‐order]:  
Re‐ordering of sentences to #2. 
!$omp parallel do private (i,j,k,ROX,ROY,ROZ)
do k = NZ00, NZ01
do j = NY00, NY01
do i = NX00, NX01
ROX = 2.0_PN/( DEN(I,J,K) + DEN(I+1,J,K) )
ROY = 2.0_PN/( DEN(I,J,K) + DEN(I,J+1,K) )
ROZ = 2.0_PN/( DEN(I,J,K) + DEN(I,J,K+1) )
VX(I,J,K) = VX(I,J,K)  + 
( DXSXX(I,J,K)+DYSXY(I,J,K)+DZSXZ(I,J,K) )*ROX*DT
VY(I,J,K) = VY(I,J,K) + 
( DXSXY(I,J,K)+DYSYY(I,J,K)+DZSYZ(I,J,K) )*ROY*DT
VZ(I,J,K) = VZ(I,J,K) + 
( DXSXZ(I,J,K)+DYSYZ(I,J,K)+DZSZZ(I,J,K) )*ROZ*DT
end do
end do
end do
!$omp end parallel do
RELATED WORK
Originality (AT Languages)
AT Language 
/ Items
#
1
#
2
#
3
#
4
#
5
#
6
#
7
ppOpen‐AT OAT Directives ✔ ✔ ✔ ✔ None
Vendor Compilers Out of Target Limited ‐
Transformation 
Recipes 
Recipe
Descriptions
✔ ✔ ChiLL
POET Xform Description ✔ ✔ POET translator, ROSE
X language 
Xlang Pragmas ✔ ✔ X Translation,
‘C and tcc
SPL SPL Expressions ✔ ✔ ✔ A Script Language
ADAPT
ADAPT 
Language
✔ ✔ Polaris
Compiler 
Infrastructure, 
Remote Procedure 
Call (RPC)
Atune‐IL
Atune Pragmas ✔ A Monitoring 
Daemon
PEPPHER
PEPPHER Pragmas
(interface)
✔ ✔ ✔ PEPPHER task graph
and run-time
Xevolver
Directive Extension
(Recipe Descriptions)
(✔) (✔) (✔) ROSE,
XSLT Translator
#1: Method for supporting multi-computer environments. #2: Obtaining loop length in run-time.
#3: Loop split with increase of computations, and loop fusion to the split loop.
#4: Re-ordering of inner-loop sentences. #5: Algorithm selection.
#6: Code generation with execution feedback. #7: Software requirement.
(Users need to define rules. )
Outline
1. Background and ppOpen‐AT Functions
2. Code Optimization Strategy
3. Performance Evaluation
4. Conclusion
Conclusion Remarks
• Loop transformation with loop fusion and 
loop split is a key technology for FDM codes 
to obtain high performance on many‐core 
architecture.
• AT with static code generation (with dynamic 
code selection) is a key technology for AT with 
minimum software stack for supercomputers 
in operation. 
To obtain free code (MIT Licensing):
http://ppopenhpc.cc.u‐tokyo.ac.jp/
Future Work
• Improving Search Algorithm
– We use a brute‐force search in current implementation.
• But it is enough by applying knowledge of application.
– We have implemented a new search algorithm.
• d‐Spline based search algorithm (a performance model) ,
collaborated with Prof. Tanaka (Kogakuin U.)
• Code Selection Between Different Architectures
– Selection of codes between vector machine 
implementation and scalar machine implementation.
– Whole code selection on main routine is needed.
• Off‐loading Implementation Selection 
(for the Xeon Phi)
– If problem size is too small to do off‐loading, 
the target execution is performed on CPU automatically.
Thank you for your attention!
Questions?
http://ppopenhpc.cc.u‐tokyo.ac.jp/
References 
(Related to authors)
1. H. Kuroda, T. Katagiri, M. Kudoh, Y. Kanada, “ILIB_GMRES: An auto‐tuning 
parallel iterative solver for linear equations,” SC2001, 2001. (Poster.)
2. T. Katagiri, K. Kise, H. Honda, T. Yuba, “ABCLib_DRSSED: A parallel 
eigensolver with an auto‐tuning facility,” Parallel Computing, Vol. 32, Issue 
3, pp. 231–250, 2006.
3. T. Sakurai, T. Katagiri, K. Naono, H. Kuroda, K. Nakajima, M. Igai, S. Ohshima, 
S. Itoh, “Evaluation of auto‐tuning function on OpenATLib,” IPSJ SIG 
Technical Reports, Vol. 2011‐HPC‐130, No. 43, pp. 1–6, 2011. (in Japanese)
4. T. Katagiri, K. Kise, H. Honda, T. Yuba, “FIBER: A general framework for auto‐
tuning software,” The Fifth International Symposium on High Performance 
Computing (ISHPC‐V), Springer LNCS 2858, pp. 146–159, 2003.
5. T. Katagiri, S. Ito, S. Ohshima, “Early experiences for adaptation of auto‐
tuning by ppOpen‐AT to an explicit method,” Special Session: Auto‐Tuning 
for Multicore and GPU (ATMG) (In Conjunction with the IEEE MCSoC‐13), 
Proceedings of MCSoC‐13, 2013. 
6. T. Katagiri, S. Ohshima, M. Matsumoto, “Auto‐tuning of computation kernels 
from an FDM Code with ppOpen‐AT,” Special Session: Auto‐Tuning for 
Multicore and GPU (ATMG) (In Conjunction with the IEEE MCSoC‐14), 
Proceedings of MCSoC‐14, 2014. 

More Related Content

Viewers also liked

Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
Takahiro Katagiri
 
ppOpen-AT : Yet Another Directive-base AT Language
ppOpen-AT : Yet Another Directive-base AT LanguageppOpen-AT : Yet Another Directive-base AT Language
ppOpen-AT : Yet Another Directive-base AT Language
Takahiro Katagiri
 
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
Takahiro Katagiri
 
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
Takahiro Katagiri
 
自動チューニングとビックデータ:機械学習の適用の可能性
自動チューニングとビックデータ:機械学習の適用の可能性自動チューニングとビックデータ:機械学習の適用の可能性
自動チューニングとビックデータ:機械学習の適用の可能性
Takahiro Katagiri
 
ソフトウェア自動チューニング研究紹介
ソフトウェア自動チューニング研究紹介ソフトウェア自動チューニング研究紹介
ソフトウェア自動チューニング研究紹介
Takahiro Katagiri
 

Viewers also liked (6)

Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
 
ppOpen-AT : Yet Another Directive-base AT Language
ppOpen-AT : Yet Another Directive-base AT LanguageppOpen-AT : Yet Another Directive-base AT Language
ppOpen-AT : Yet Another Directive-base AT Language
 
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
ppOpen-HPCコードを自動チューニングする言語ppOpen-ATの現状と新展開
 
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
Towards Auto‐tuning for the Finite Difference Method in Era of 200+ Thread Pa...
 
自動チューニングとビックデータ:機械学習の適用の可能性
自動チューニングとビックデータ:機械学習の適用の可能性自動チューニングとビックデータ:機械学習の適用の可能性
自動チューニングとビックデータ:機械学習の適用の可能性
 
ソフトウェア自動チューニング研究紹介
ソフトウェア自動チューニング研究紹介ソフトウェア自動チューニング研究紹介
ソフトウェア自動チューニング研究紹介
 

Similar to iWAPT2015_katagiri

Hyper Threading technology
Hyper Threading technologyHyper Threading technology
Hyper Threading technology
Karunakar Singh Thakur
 
Cutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tuneCutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tune
XiaoweiJiang7
 
Hyper threading
Hyper threadingHyper threading
Hyper threading
Anmol Purohit
 
C3 w3
C3 w3C3 w3
hpcpp.pptx
hpcpp.pptxhpcpp.pptx
hpcpp.pptx
pradhyumnpurohit1
 
Unit 2 processor&amp;memory-organisation
Unit 2 processor&amp;memory-organisationUnit 2 processor&amp;memory-organisation
Unit 2 processor&amp;memory-organisation
Pavithra S
 
Unit 1 processormemoryorganisation
Unit 1 processormemoryorganisationUnit 1 processormemoryorganisation
Unit 1 processormemoryorganisation
Karunamoorthy B
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
Ganesan Narayanasamy
 
KSpeculative aspects of high-speed processor design
KSpeculative aspects of high-speed processor designKSpeculative aspects of high-speed processor design
KSpeculative aspects of high-speed processor design
ssuser7dcef0
 
Hyper Threading Technology
Hyper Threading TechnologyHyper Threading Technology
Hyper Threading Technologynayakslideshare
 
A Parallel Computing-a Paradigm to achieve High Performance
A Parallel Computing-a Paradigm to achieve High PerformanceA Parallel Computing-a Paradigm to achieve High Performance
A Parallel Computing-a Paradigm to achieve High Performance
AM Publications
 
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CanSecWest
 
QuantumChemistry500
QuantumChemistry500QuantumChemistry500
QuantumChemistry500
Maho Nakata
 
Hyper threading technology
Hyper threading technologyHyper threading technology
Hyper threading technology
Nikhil Venugopal
 
TULIPP at the 10th Intelligent Imaging Event
TULIPP at the 10th Intelligent Imaging EventTULIPP at the 10th Intelligent Imaging Event
TULIPP at the 10th Intelligent Imaging Event
Sundance Multiprocessor Technology Ltd.
 
Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors ppt
Siddhartha Anand
 
Computer Architecture and Organization
Computer Architecture and OrganizationComputer Architecture and Organization
Computer Architecture and Organization
ssuserdfc773
 

Similar to iWAPT2015_katagiri (20)

Hyper Threading technology
Hyper Threading technologyHyper Threading technology
Hyper Threading technology
 
Cutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tuneCutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tune
 
Hyper threading
Hyper threadingHyper threading
Hyper threading
 
Mpmc
MpmcMpmc
Mpmc
 
C3 w3
C3 w3C3 w3
C3 w3
 
hpcpp.pptx
hpcpp.pptxhpcpp.pptx
hpcpp.pptx
 
Unit 2 processor&amp;memory-organisation
Unit 2 processor&amp;memory-organisationUnit 2 processor&amp;memory-organisation
Unit 2 processor&amp;memory-organisation
 
Unit 1 processormemoryorganisation
Unit 1 processormemoryorganisationUnit 1 processormemoryorganisation
Unit 1 processormemoryorganisation
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
KSpeculative aspects of high-speed processor design
KSpeculative aspects of high-speed processor designKSpeculative aspects of high-speed processor design
KSpeculative aspects of high-speed processor design
 
Hyper Threading Technology
Hyper Threading TechnologyHyper Threading Technology
Hyper Threading Technology
 
A Parallel Computing-a Paradigm to achieve High Performance
A Parallel Computing-a Paradigm to achieve High PerformanceA Parallel Computing-a Paradigm to achieve High Performance
A Parallel Computing-a Paradigm to achieve High Performance
 
20120140505010
2012014050501020120140505010
20120140505010
 
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
CSW2017Richard Johnson_harnessing intel processor trace on windows for vulner...
 
QuantumChemistry500
QuantumChemistry500QuantumChemistry500
QuantumChemistry500
 
Hyper threading technology
Hyper threading technologyHyper threading technology
Hyper threading technology
 
TULIPP at the 10th Intelligent Imaging Event
TULIPP at the 10th Intelligent Imaging EventTULIPP at the 10th Intelligent Imaging Event
TULIPP at the 10th Intelligent Imaging Event
 
Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors ppt
 
Computer Architecture and Organization
Computer Architecture and OrganizationComputer Architecture and Organization
Computer Architecture and Organization
 

Recently uploaded

Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
drwaing
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
manasideore6
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
Kamal Acharya
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
dxobcob
 

Recently uploaded (20)

Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
 

iWAPT2015_katagiri