Molecular models, threads and you

1,168 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,168
On SlideShare
0
From Embeds
0
Number of Embeds
18
Actions
Shares
0
Downloads
29
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Molecular models, threads and you

  1. 1. Molecular Models, Threads and You Optimizing the TINKER classical molecular dynamics code while maintaining code readability Jiahao Chen Martínez Group Dept. Chemistry, CATMS, MRL and Beckman CS 498 MG presentation: 2007-12-07
  2. 2. Molecular models/force fields Typical energy function E = covalent bond effects + noncovalent interactions
  3. 3. Molecular models/force fields Typical energy function E= kb (rb − req,b )2+ κa (θa − θeq,a )2 + lnd cos (nπ) d∈dihedrals n a∈angles b∈bonds bond stretch angle torsion dihedrals + - 12 6 qi qj σij σij + + − ij rij rij rij i<j∈atoms i<j∈atoms electrostatics dispersion computation cost = O(N2)
  4. 4. Problem description • The state of the system is given by the position and momentum of every atom (of mass mi) (x1 , p1 , x2 , p2 , · · · , xN , pN ) ∈ R 3×2×N • Solve the system∂p partial differential equations of ∂x p ∂E i i i = =− , i = 1, · · · , N , ∂t mi ∂t ∂xi • with user-specified initial conditions (e.g. with constant temperature and pressure) • Subject to (user-specified) constraints, e.g. fixed bond angles
  5. 5. Many parallel and serial implementations Global Package name Threads MPI Arrays NAMD CHARM++ GROMACS ✓ ✓ TINKER AMBER partly ✓ ✓ CHARMM ✓ LAMMPS ✓ NWChem ✓ ✓
  6. 6. Things I tried • Compiler flags optimization • Cache miss reduction • Lookup tables • Parallelization with OpenMP
  7. 7. Compiler flag optimization flags gfortran 4.1.2 ifort 10.0.023 - - -O0 29.95(2) s 36.30(2) s 32.59(4) s -Os 29.92(3) s +0.77(3) % +10.22(2) % 32.12(3) s -O1 30.22(1) s -0.90(4) % +11.51(1) % -O2 29.66(3) s +0.96(1) % 30.30(2) s +16.54(2) % 30.83(2) s -O3 29.84(2) s +0.38(2) % +15.06(2) % +20.22(1)%2 CE search 28.77(2) s +3.62(3) %1 28.96(2) s 1. FFLAGS =”-falign-functions -falign-jumps -falign-labels -falign-loops -fvpt -fcse-skip-blocks -fdelete-null-pointer- checks -ffast-math -fforce-addr -fgcse -fgcse-lm -fgcse-sm -floop-optimize -fkeep-static-consts -fmerge-constants -fno- defer-pop -fno-guess-branch-probability -fno-math-errno -funsafe-math-optimizations -fno-trapping-math -foptimize- register-move -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop -fno-sched-spec -fsched-spec-load -fsched-stalled-insns -fsignaling-nans -fsingle-precision-constant -fstrength-reduce -fthread-jumps -funroll-all-loops” 2. FFLAGS =”-xN -no-prec-div -static -inline-level=1 -ip -fno-alias -fno-fnalias -fno-omit-frame-pointer -fkeep-static- consts -nolib-inline -heap-arrays 1 -pad -O3 -scalar-rep -funroll-loops -complex-limited-range”
  8. 8. Algorithm and time profile N=6 for each time step gfortran 4.1.2 >98% Initialize Remove Move one model and unphysical Flush I/O End time step parameters motions O(N) O(N2) Update Calculate Update Calculate & record Enforce Enforce state potential energy state kinetic energy and temp. & temp. & by t/2 and forces by t/2 properties pressure pressure >59% <31% O(N) O(N ) 2 Calculate Calculate Calculate Calculate Calculate Add up all ... bond angle dihedral dispersion charge compo- interactions interactions interactions interactions interactions nents 9% 12% 8% 37% 26%
  9. 9. An unexpected cost for each time step N=6 Q: WhyRemove15% is >98% Initialize Move one model and unphysical Flush I/O End of total execution time step parameters motions O(N ) Text time spent adding Calculate & record O(N) 2 Update Calculate Update Enforce Enforce numbers!? state potential energy state kinetic energy and temp. & temp. & by t/2 and forces by t/2 properties pressure pressure >59% <31% O(N) O(N ) 2 Add up all Calculate Calculate Calculate Calculate Calculate ... compo- bond angle dihedral dispersion charge nents interactions interactions interactions interactions interactions 9% 12% 8% 37% 26%
  10. 10. A: many L2 cache misses c zero out each of the first derivative components 7 do i = 1, n do j = 1, 3 42 deb(j,i) = 0.0d0 22 other ... end do terms end do ... c sum up to get the total energy and first derivatives energy = eb + ... do i = 1, n do j = 1, 3 desum(j,i) = deb(j,i) + ... 22 other 19 terms 2 derivs(j,i) = desum(j,i) end do end do 70 of 91 cache misses per time step (n = 6) shown
  11. 11. A simple solution c zero out each of the first derivative components 7 do i = 1, n do j = 1, 3 26 42 deb(j,i) = 0.0d0 ... end do end do ... c sum up to get the total energy and first derivatives energy = eb + ... do i = 1, n do j = 1, 3 6 temp = deb(j,i) + ... 1 19 desum(j,i) = temp 12 derivs(j,i) = temp end do end do reduced cache misses from 92 to 41 per time step
  12. 12. Speedup from reducing L2 cache misses flags gfortran 4.1.2 ifort 10.0.023 original 29.95(2) s 28.96(2) s with scalar 27.43(3) s 28.95(1) s replacement speedup +8.44(1) % +0.03(2) % ifort already called with scalar replacement flag
  13. 13. Lookup tables (LUTs) • Calculations of sqrt() and exp() take up 23.8% of execution time • Idea: pre-compute values of sqrt() and exp() in an array and recall them from memory when needed • Caution: LUT should not displace too much data from L2 cache
  14. 14. sqrt() with LUT direct LUT LUT with linear interpolation
  15. 15. exp() with LUT LUT with first-order Taylor direct LUT series refinement* e =e + (x − x0 )e + O (x − x0 ) x x0 x0 2
  16. 16. Choice of implementation desired table expected function refinement precision size speedup (doubl sqrt() 10 -4 10,764 none +118% es) exp() 10-8 6,836 Taylor +151% LUT aligned to 128-bits L2 cache = 4 MB = 512K doubles
  17. 17. Speedup from LUT use flags gfortran 4.1.2 ifort 10.0.023 original 29.95(2) s 28.96(2) s with lookup tables 26.89(1) s 25.87(2) s speedup +10.23(2) % +7.22(3) %
  18. 18. Summary of serial improvements Improvement gfortran 4.1.2 ifort 10.0.023 Best compiler flags +3.62(3) % +20.22(1) % L2 cache miss +8.44(2) % +0.03(1) % reduction Lookup tables +10.23(1) % +7.22(2) % 23.91(3) s 26.86(2) s Total +20.17(4) % +26.00(2) %
  19. 19. Parallelization targets for each time step N=6 >98% Initialize Remove Move one model and unphysical Flush I/O End time step parameters motions Text O(N) O(N2) Update Calculate Update Calculate & record Enforce Enforce state potential energy state kinetic energy and temp. & temp. & by t/2 and forces by t/2 properties pressure pressure >59% <31% O(N) O(N ) 2 Add up all Calculate Calculate Calculate Calculate Calculate ... compo- bond angle dihedral dispersion charge nents interactions interactions interactions interactions interactions 9% 12% 8% 37% 26%
  20. 20. Parallelization strategy Calculate potential energy omp sections and forces 100% omp section 50% omp section 50% Add up all Calculate Calculate Calculate Calculate Calculate ... compo- charge angle dihedral dispersion bond nents interactions interactions interactions interactions interactions 50% 16% 2% 12% 11% omp parallel do omp parallel do omp parallel do omp parallel do omp parallel do
  21. 21. Parallelization results gfortran 4.1.2 35 N=6 N=1000 Ideal 30 Execution time/s 25 20 15 10 # cores 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5
  22. 22. Summary • Free software can sometimes be better than non-free software • L2 cache misses can significantly degrade performance • Lookup tables are an effective tradeoff between speed and memory vs. precision • Simple OpenMP parallelization is effective for small numbers of processors

×