Intel® Xeon® Phi Coprocessor High Performance Programming

11,459
-1

Published on

Published in: Technology, Health & Medicine
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
11,459
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
15
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Intel® Xeon® Phi Coprocessor High Performance Programming

  1. 1. Intel® Xeon® Phi Coprocessor High Performance Programming Parallelizing a Simple Image Blurring Algorithm Brian Gesiak April 16th, 2014 Research Student, The University of Tokyo @modocache
  2. 2. Today • Image blurring with a 9-point stencil algorithm • Comparing performance • Intel® Xeon® Dual Processor • Intel® Xeon® Phi Coprocessor • Iteratively improving performance • Worst: Completely serial • Better: Adding loop vectorization • Best: Supporting multiple threads • Further optimizations • Padding arrays for improved cache performance • Read-less writes, i.e.: streaming stores • Using huge memory pages
  3. 3. Stencil Algorithms A 9-Point Stencil on a 2D Matrix
  4. 4. Stencil Algorithms A 9-Point Stencil on a 2D Matrix
  5. 5. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; A 9-Point Stencil on a 2D Matrix
  6. 6. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; weight.center; A 9-Point Stencil on a 2D Matrix
  7. 7. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; weight.center; weight.next; A 9-Point Stencil on a 2D Matrix
  8. 8. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; weight.center; weight.diagonal; weight.next; A 9-Point Stencil on a 2D Matrix
  9. 9. Image Blurring Applying a 9-Point Stencil to a Bitmap
  10. 10. Image Blurring Applying a 9-Point Stencil to a Bitmap
  11. 11. Image Blurring Applying a 9-Point Stencil to a Bitmap
  12. 12. Halo Effect Image Blurring Applying a 9-Point Stencil to a Bitmap
  13. 13. • Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times Sample Application
  14. 14. • Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times Sample Application
  15. 15. • Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times Sample Application
  16. 16. Comparing Processors Xeon® Dual Processor vs. Xeon® Phi Coprocessor Processor Clock Frequency Number of Cores Memory Size/Type Peak DP/SP FLOPs Peak Memory Bandwidth
  17. 17. Comparing Processors Xeon® Dual Processor vs. Xeon® Phi Coprocessor Processor Clock Frequency Number of Cores Memory Size/Type Peak DP/SP FLOPs Peak Memory Bandwidth Intel® Xeon® Dual Processor 2.6 GHz 16 (8 x 2 CPUs) 63 GB / DDR3 345.6 / 691.2 GigaFLOP/s 85.3 GB/s
  18. 18. Comparing Processors Xeon® Dual Processor vs. Xeon® Phi Coprocessor Processor Clock Frequency Number of Cores Memory Size/Type Peak DP/SP FLOPs Peak Memory Bandwidth Intel® Xeon® Dual Processor 2.6 GHz 16 (8 x 2 CPUs) 63 GB / DDR3 345.6 / 691.2 GigaFLOP/s 85.3 GB/s Intel® Xeon® Phi Coprocessor 1.091 GHz 61 8 GB/ GDDR5 1.065/2.130 TeraFLOP/s 352 GB/s
  19. 19. 1st Comparison: Serial Execution
  20. 20. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  21. 21. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  22. 22. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  23. 23. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  24. 24. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  25. 25. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  26. 26. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  27. 27. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  28. 28. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  29. 29. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } } Assumed vector dependency
  30. 30. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results
  31. 31. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results $ icc -openmp -O3 stencil.c -o stencil
  32. 32. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results $ icc -openmp -mmic -O3 stencil.c -o stencil_phi
  33. 33. Dual is 11 times faster than Phi Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results
  34. 34. 2nd Comparison: Vectorization for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Ignoring Assumed Vector Dependencies
  35. 35. 2nd Comparison: Vectorization for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Ignoring Assumed Vector Dependencies
  36. 36. ivdep Tells compiler to ignore assumed dependencies Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm
  37. 37. ivdep Tells compiler to ignore assumed dependencies • In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do. Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm
  38. 38. ivdep Tells compiler to ignore assumed dependencies • In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do. • The ivdep pragma negates this assumption. Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm
  39. 39. ivdep Tells compiler to ignore assumed dependencies • In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do. • The ivdep pragma negates this assumption. • Proven dependencies may not be ignored. Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm
  40. 40. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results
  41. 41. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results $ icc -openmp -O3 stencil.c -o stencil 1.3 times faster
  42. 42. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results $ icc -openmp -mmic -O3 stencil.c -o stencil_phi 4.5 times faster 1.3 times faster
  43. 43. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results 4.5 times faster 1.3 times faster Dual is now only 4 times faster than Phi
  44. 44. 3rd Comparison: Multithreading Work Division Using Parallel For Loops
  45. 45. 3rd Comparison: Multithreading #pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Work Division Using Parallel For Loops
  46. 46. 3rd Comparison: Multithreading #pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Work Division Using Parallel For Loops
  47. 47. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results
  48. 48. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results
  49. 49. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results
  50. 50. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results 4x 71x
  51. 51. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results 4x 71x Phi now 5 times faster
  52. 52. Further Optimizations
  53. 53. Further Optimizations 1. Padded arrays
  54. 54. Further Optimizations 1. Padded arrays 2. Streaming stores
  55. 55. Further Optimizations 1. Padded arrays 2. Streaming stores 3. Huge memory pages
  56. 56. Optimization 1: Padded Arrays Optimizing Cache Access
  57. 57. Optimization 1: Padded Arrays • We can add extra, unused data to the end of each row Optimizing Cache Access
  58. 58. Optimization 1: Padded Arrays • We can add extra, unused data to the end of each row • Doing so aligns heavily used memory addresses for efficient cache line access Optimizing Cache Access
  59. 59. Optimization 1: Padded Arrays
  60. 60. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  61. 61. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  62. 62. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  63. 63. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  64. 64. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  65. 65. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  66. 66. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  67. 67. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; } ((5900*sizeof(real)+63)/64)*(64/sizeof(real));
  68. 68. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; } (real *)_mm_malloc(size, kPaddingSize); (real *)_mm_malloc(size, kPaddingSize); sizeof(real)* width*kPaddingSize * height; ((5900*sizeof(real)+63)/64)*(64/sizeof(real));
  69. 69. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; } _mm_free(fin); _mm_free(fout); (real *)_mm_malloc(size, kPaddingSize); (real *)_mm_malloc(size, kPaddingSize); sizeof(real)* width*kPaddingSize * height; ((5900*sizeof(real)+63)/64)*(64/sizeof(real));
  70. 70. Optimization 1: Padded Arrays Accommodating for Padding
  71. 71. Optimization 1: Padded Arrays #pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... } Accommodating for Padding
  72. 72. Optimization 1: Padded Arrays #pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... } Accommodating for Padding
  73. 73. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 11.644 86,138.371 Xeon® Phi, 122 Threads 8.973 111,774.803 Xeon® Phi, 183 Threads 10.326 97,132.546 Xeon® Phi, 244 Threads 11.469 87,452.707 Optimization 1: Padded Arrays Results
  74. 74. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 11.644 86,138.371 Xeon® Phi, 122 Threads 8.973 111,774.803 Xeon® Phi, 183 Threads 10.326 97,132.546 Xeon® Phi, 244 Threads 11.469 87,452.707 Optimization 1: Padded Arrays Results
  75. 75. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 11.644 86,138.371 Xeon® Phi, 122 Threads 8.973 111,774.803 Xeon® Phi, 183 Threads 10.326 97,132.546 Xeon® Phi, 244 Threads 11.469 87,452.707 Optimization 1: Padded Arrays Results
  76. 76. Optimization 2: Streaming Stores Read-less Writes
  77. 77. Optimization 2: Streaming Stores Read-less Writes • By default, Xeon® Phi processors read the value at an address before writing to that address.
  78. 78. Optimization 2: Streaming Stores Read-less Writes • By default, Xeon® Phi processors read the value at an address before writing to that address. • When calculating the weighted average for a pixel in our program, we do not use the original value of that pixel. Therefore, enabling streaming stores should result in better performance.
  79. 79. Optimization 2: Streaming Stores for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Read-less Writes with Vector Nontemporal
  80. 80. Optimization 2: Streaming Stores for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Read-less Writes with Vector Nontemporal
  81. 81. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 13.588 73,978.915 Xeon® Phi, 122 Threads 8.491 111,774.803 Xeon® Phi, 183 Threads 8.663 115,773.405 Xeon® Phi, 244 Threads 9.507 105,498.781 Optimization 2: Streaming Stores Results
  82. 82. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 13.588 73,978.915 Xeon® Phi, 122 Threads 8.491 111,774.803 Xeon® Phi, 183 Threads 8.663 115,773.405 Xeon® Phi, 244 Threads 9.507 105,498.781 Optimization 2: Streaming Stores Results
  83. 83. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 13.588 73,978.915 Xeon® Phi, 122 Threads 8.491 111,774.803 Xeon® Phi, 183 Threads 8.663 115,773.405 Xeon® Phi, 244 Threads 9.507 105,498.781 Optimization 2: Streaming Stores Results
  84. 84. Optimization 3: Huge Memory Pages
  85. 85. • Memory pages map virtual memory used by our program to physical memory Optimization 3: Huge Memory Pages
  86. 86. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buffer (TLB) Optimization 3: Huge Memory Pages
  87. 87. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buffer (TLB) • Mappings are traversed in a“page table walk” Optimization 3: Huge Memory Pages
  88. 88. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buffer (TLB) • Mappings are traversed in a“page table walk” • malloc and _mm_malloc use 4KB memory pages by default Optimization 3: Huge Memory Pages
  89. 89. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buffer (TLB) • Mappings are traversed in a“page table walk” • malloc and _mm_malloc use 4KB memory pages by default • By increasing the size of each memory page, traversal time may be reduced Optimization 3: Huge Memory Pages
  90. 90. Optimization 3: Huge Memory Pages size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);
  91. 91. Optimization 3: Huge Memory Pages size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);real *fin = (real *)mmap(0, size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);
  92. 92. Optimization 3: Huge Memory Pages size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);real *fin = (real *)mmap(0, size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);
  93. 93. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 14.486 69,239.365 Xeon® Phi, 122 Threads 8.226 121,924.389 Xeon® Phi, 183 Threads 8.749 114,636.799 Xeon® Phi, 244 Threads 9.466 105,955.358 Results Optimization 3: Huge Memory Pages
  94. 94. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 14.486 69,239.365 Xeon® Phi, 122 Threads 8.226 121,924.389 Xeon® Phi, 183 Threads 8.749 114,636.799 Xeon® Phi, 244 Threads 9.466 105,955.358 Results Optimization 3: Huge Memory Pages
  95. 95. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 14.486 69,239.365 Xeon® Phi, 122 Threads 8.226 121,924.389 Xeon® Phi, 183 Threads 8.749 114,636.799 Xeon® Phi, 244 Threads 9.466 105,955.358 Results Optimization 3: Huge Memory Pages
  96. 96. Takeaways • The key to achieving high-performance is to use loop vectorization and multiple threads • Completely serial programs run faster on standard processors • Only properly designed programs achieve peak performance on an Intel® Xeon® Phi Coprocessor • Other optimizations may be used to tweak performance • Data padding, • Streaming stores • Huge memory pages
  97. 97. Sources and Additional Resources • Today’s slides • http://modocache.io/xeon-phi-high-performance • Intel® Xeon® Phi Coprocessor High Performance Programming (James Jeffers, James Reinders) • http://www.amazon.com/dp/0124104142 • Intel Documentation • ivdep: https://software.intel.com/sites/products/ documentation/doclib/iss/2013/compiler/cpp-lin/ GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm • vector: https://software.intel.com/sites/products/ documentation/studio/composer/en-us/2011Update/ compiler_c/cref_cls/common/ cppref_pragma_vector.htm
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×