0
Intel® Xeon® Phi Coprocessor
High Performance Programming
Parallelizing a Simple Image Blurring
Algorithm
Brian Gesiak
Apr...
Today
• Image blurring with a 9-point stencil algorithm
• Comparing performance
• Intel® Xeon® Dual Processor
• Intel® Xeo...
Stencil Algorithms
A 9-Point Stencil on a 2D Matrix
Stencil Algorithms
A 9-Point Stencil on a 2D Matrix
Stencil Algorithms
typedef double real;
typedef struct {
real center;
real next;
real diagonal;
} weight_t;
A 9-Point Sten...
Stencil Algorithms
typedef double real;
typedef struct {
real center;
real next;
real diagonal;
} weight_t;
weight.center;...
Stencil Algorithms
typedef double real;
typedef struct {
real center;
real next;
real diagonal;
} weight_t;
weight.center;...
Stencil Algorithms
typedef double real;
typedef struct {
real center;
real next;
real diagonal;
} weight_t;
weight.center;...
Image Blurring
Applying a 9-Point Stencil to a Bitmap
Image Blurring
Applying a 9-Point Stencil to a Bitmap
Image Blurring
Applying a 9-Point Stencil to a Bitmap
Halo Effect
Image Blurring
Applying a 9-Point Stencil to a Bitmap
• Apply a 9-point stencil to a 5,900 x 10,000 px image
• Apply the stencil 1,000 times
Sample Application
• Apply a 9-point stencil to a 5,900 x 10,000 px image
• Apply the stencil 1,000 times
Sample Application
• Apply a 9-point stencil to a 5,900 x 10,000 px image
• Apply the stencil 1,000 times
Sample Application
Comparing Processors
Xeon® Dual Processor vs. Xeon® Phi Coprocessor
Processor
Clock
Frequency
Number of
Cores
Memory
Size/...
Comparing Processors
Xeon® Dual Processor vs. Xeon® Phi Coprocessor
Processor
Clock
Frequency
Number of
Cores
Memory
Size/...
Comparing Processors
Xeon® Dual Processor vs. Xeon® Phi Coprocessor
Processor
Clock
Frequency
Number of
Cores
Memory
Size/...
1st Comparison: Serial Execution
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count...
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count...
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count...
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count...
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count...
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count...
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count...
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count...
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count...
1st Comparison: Serial Execution
void stencil_9pt(real *fin, real *fout, int width, int height,
weight_t weight, int count...
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
244.178 seconds
(4 minutes)
4,107.658
Intel® Xeon® Phi
C...
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
244.178 seconds
(4 minutes)
4,107.658
Intel® Xeon® Phi
C...
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
244.178 seconds
(4 minutes)
4,107.658
Intel® Xeon® Phi
C...
Dual is 11 times faster than Phi
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
244.178 seconds
(4 minu...
2nd Comparison: Vectorization
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate cen...
2nd Comparison: Vectorization
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate cen...
ivdep
Tells compiler to ignore assumed dependencies
Source: https://software.intel.com/sites/products/documentation/doclib...
ivdep
Tells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether
the two pointe...
ivdep
Tells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether
the two pointe...
ivdep
Tells compiler to ignore assumed dependencies
• In our program, the compiler cannot determine whether
the two pointe...
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
186.585 seconds
(3.1 minutes)
5,375.572
Intel® Xeon® Phi...
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
186.585 seconds
(3.1 minutes)
5,375.572
Intel® Xeon® Phi...
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
186.585 seconds
(3.1 minutes)
5,375.572
Intel® Xeon® Phi...
Processor Elapsed Wall Time MegaFLOPS
Intel® Xeon® Dual
Processor
186.585 seconds
(3.1 minutes)
5,375.572
Intel® Xeon® Phi...
3rd Comparison: Multithreading
Work Division Using Parallel For Loops
3rd Comparison: Multithreading
#pragma omp parallel for
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ...
3rd Comparison: Multithreading
#pragma omp parallel for
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ...
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Dual Proc.,
16 Threads
43.862 22,867.185
Xeon® Dual Proc.,
32 Thread...
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Dual Proc.,
16 Threads
43.862 22,867.185
Xeon® Dual Proc.,
32 Thread...
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Dual Proc.,
16 Threads
43.862 22,867.185
Xeon® Dual Proc.,
32 Thread...
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Dual Proc.,
16 Threads
43.862 22,867.185
Xeon® Dual Proc.,
32 Thread...
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Dual Proc.,
16 Threads
43.862 22,867.185
Xeon® Dual Proc.,
32 Thread...
Further Optimizations
Further Optimizations
1. Padded arrays
Further Optimizations
1. Padded arrays
2. Streaming stores
Further Optimizations
1. Padded arrays
2. Streaming stores
3. Huge memory pages
Optimization 1: Padded Arrays
Optimizing Cache Access
Optimization 1: Padded Arrays
• We can add extra, unused data to the end of each row
Optimizing Cache Access
Optimization 1: Padded Arrays
• We can add extra, unused data to the end of each row
• Doing so aligns heavily used memory...
Optimization 1: Padded Arrays
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = ...
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = ...
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = ...
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = ...
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = ...
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = ...
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = ...
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = ...
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = ...
Optimization 1: Padded Arrays
static const size_t kPaddingSize = 64;
int main(int argc, const char **argv) {
int height = ...
Optimization 1: Padded Arrays
Accommodating for Padding
Optimization 1: Padded Arrays
#pragma omp parallel for
for (int y = 1; y < height - 1; ++y) {
!
// ...calculate center, ea...
Optimization 1: Padded Arrays
#pragma omp parallel for
for (int y = 1; y < height - 1; ++y) {
!
// ...calculate center, ea...
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
11.644 86,138.371
Xeon® Phi,
122 Threads
8.973 111,7...
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
11.644 86,138.371
Xeon® Phi,
122 Threads
8.973 111,7...
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
11.644 86,138.371
Xeon® Phi,
122 Threads
8.973 111,7...
Optimization 2: Streaming Stores
Read-less Writes
Optimization 2: Streaming Stores
Read-less Writes
• By default, Xeon® Phi processors read the value at an
address before w...
Optimization 2: Streaming Stores
Read-less Writes
• By default, Xeon® Phi processors read the value at an
address before w...
Optimization 2: Streaming Stores
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate ...
Optimization 2: Streaming Stores
for (int i = 0; i < count; ++i) {
for (int y = 1; y < height - 1; ++y) {
// ...calculate ...
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
13.588 73,978.915
Xeon® Phi,
122 Threads
8.491 111,7...
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
13.588 73,978.915
Xeon® Phi,
122 Threads
8.491 111,7...
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
13.588 73,978.915
Xeon® Phi,
122 Threads
8.491 111,7...
Optimization 3: Huge Memory Pages
• Memory pages map virtual memory used by our
program to physical memory
Optimization 3: Huge Memory Pages
• Memory pages map virtual memory used by our
program to physical memory
• Mappings are stored in a translation look-aside...
• Memory pages map virtual memory used by our
program to physical memory
• Mappings are stored in a translation look-aside...
• Memory pages map virtual memory used by our
program to physical memory
• Mappings are stored in a translation look-aside...
• Memory pages map virtual memory used by our
program to physical memory
• Mappings are stored in a translation look-aside...
Optimization 3: Huge Memory Pages
size_t size = sizeof(real) * width * kPaddingSize * height;
real *fin = (real *)_mm_mall...
Optimization 3: Huge Memory Pages
size_t size = sizeof(real) * width * kPaddingSize * height;
real *fin = (real *)_mm_mall...
Optimization 3: Huge Memory Pages
size_t size = sizeof(real) * width * kPaddingSize * height;
real *fin = (real *)_mm_mall...
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
14.486 69,239.365
Xeon® Phi,
122 Threads
8.226 121,9...
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
14.486 69,239.365
Xeon® Phi,
122 Threads
8.226 121,9...
Processor
Elapsed Wall Time
(seconds)
MegaFLOPS
Xeon® Phi,
61 Threads
14.486 69,239.365
Xeon® Phi,
122 Threads
8.226 121,9...
Takeaways
• The key to achieving high-performance is to use loop
vectorization and multiple threads
• Completely serial pr...
Sources and Additional Resources
• Today’s slides
• http://modocache.io/xeon-phi-high-performance
• Intel® Xeon® Phi Copro...
Upcoming SlideShare
Loading in...5
×

Intel® Xeon® Phi Coprocessor High Performance Programming

10,364

Published on

Published in: Technology, Health & Medicine
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
10,364
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
13
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Intel® Xeon® Phi Coprocessor High Performance Programming"

  1. 1. Intel® Xeon® Phi Coprocessor High Performance Programming Parallelizing a Simple Image Blurring Algorithm Brian Gesiak April 16th, 2014 Research Student, The University of Tokyo @modocache
  2. 2. Today • Image blurring with a 9-point stencil algorithm • Comparing performance • Intel® Xeon® Dual Processor • Intel® Xeon® Phi Coprocessor • Iteratively improving performance • Worst: Completely serial • Better: Adding loop vectorization • Best: Supporting multiple threads • Further optimizations • Padding arrays for improved cache performance • Read-less writes, i.e.: streaming stores • Using huge memory pages
  3. 3. Stencil Algorithms A 9-Point Stencil on a 2D Matrix
  4. 4. Stencil Algorithms A 9-Point Stencil on a 2D Matrix
  5. 5. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; A 9-Point Stencil on a 2D Matrix
  6. 6. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; weight.center; A 9-Point Stencil on a 2D Matrix
  7. 7. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; weight.center; weight.next; A 9-Point Stencil on a 2D Matrix
  8. 8. Stencil Algorithms typedef double real; typedef struct { real center; real next; real diagonal; } weight_t; weight.center; weight.diagonal; weight.next; A 9-Point Stencil on a 2D Matrix
  9. 9. Image Blurring Applying a 9-Point Stencil to a Bitmap
  10. 10. Image Blurring Applying a 9-Point Stencil to a Bitmap
  11. 11. Image Blurring Applying a 9-Point Stencil to a Bitmap
  12. 12. Halo Effect Image Blurring Applying a 9-Point Stencil to a Bitmap
  13. 13. • Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times Sample Application
  14. 14. • Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times Sample Application
  15. 15. • Apply a 9-point stencil to a 5,900 x 10,000 px image • Apply the stencil 1,000 times Sample Application
  16. 16. Comparing Processors Xeon® Dual Processor vs. Xeon® Phi Coprocessor Processor Clock Frequency Number of Cores Memory Size/Type Peak DP/SP FLOPs Peak Memory Bandwidth
  17. 17. Comparing Processors Xeon® Dual Processor vs. Xeon® Phi Coprocessor Processor Clock Frequency Number of Cores Memory Size/Type Peak DP/SP FLOPs Peak Memory Bandwidth Intel® Xeon® Dual Processor 2.6 GHz 16 (8 x 2 CPUs) 63 GB / DDR3 345.6 / 691.2 GigaFLOP/s 85.3 GB/s
  18. 18. Comparing Processors Xeon® Dual Processor vs. Xeon® Phi Coprocessor Processor Clock Frequency Number of Cores Memory Size/Type Peak DP/SP FLOPs Peak Memory Bandwidth Intel® Xeon® Dual Processor 2.6 GHz 16 (8 x 2 CPUs) 63 GB / DDR3 345.6 / 691.2 GigaFLOP/s 85.3 GB/s Intel® Xeon® Phi Coprocessor 1.091 GHz 61 8 GB/ GDDR5 1.065/2.130 TeraFLOP/s 352 GB/s
  19. 19. 1st Comparison: Serial Execution
  20. 20. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  21. 21. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  22. 22. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  23. 23. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  24. 24. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  25. 25. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  26. 26. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  27. 27. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  28. 28. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } }
  29. 29. 1st Comparison: Serial Execution void stencil_9pt(real *fin, real *fout, int width, int height, weight_t weight, int count) { for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } ! // Swap buffers for next iteration real *ftmp = fin; fin = fout; fout = ftmp; } } Assumed vector dependency
  30. 30. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results
  31. 31. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results $ icc -openmp -O3 stencil.c -o stencil
  32. 32. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results $ icc -openmp -mmic -O3 stencil.c -o stencil_phi
  33. 33. Dual is 11 times faster than Phi Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 244.178 seconds (4 minutes) 4,107.658 Intel® Xeon® Phi Coprocessor 2,838.342 seconds (47.3 minutes) 353.375 1st Comparison: Serial Execution Results
  34. 34. 2nd Comparison: Vectorization for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Ignoring Assumed Vector Dependencies
  35. 35. 2nd Comparison: Vectorization for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Ignoring Assumed Vector Dependencies
  36. 36. ivdep Tells compiler to ignore assumed dependencies Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm
  37. 37. ivdep Tells compiler to ignore assumed dependencies • In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do. Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm
  38. 38. ivdep Tells compiler to ignore assumed dependencies • In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do. • The ivdep pragma negates this assumption. Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm
  39. 39. ivdep Tells compiler to ignore assumed dependencies • In our program, the compiler cannot determine whether the two pointers refer to the same block of memory. So the compiler assumes they do. • The ivdep pragma negates this assumption. • Proven dependencies may not be ignored. Source: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-B25ABCC2-BE6F-4599- AEDF-2434F4676E1B.htm
  40. 40. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results
  41. 41. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results $ icc -openmp -O3 stencil.c -o stencil 1.3 times faster
  42. 42. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results $ icc -openmp -mmic -O3 stencil.c -o stencil_phi 4.5 times faster 1.3 times faster
  43. 43. Processor Elapsed Wall Time MegaFLOPS Intel® Xeon® Dual Processor 186.585 seconds (3.1 minutes) 5,375.572 Intel® Xeon® Phi Coprocessor 623.302 seconds (10.3 minutes) 1,609.171 2nd Comparison: Vectorization Results 4.5 times faster 1.3 times faster Dual is now only 4 times faster than Phi
  44. 44. 3rd Comparison: Multithreading Work Division Using Parallel For Loops
  45. 45. 3rd Comparison: Multithreading #pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Work Division Using Parallel For Loops
  46. 46. 3rd Comparison: Multithreading #pragma omp parallel for for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Work Division Using Parallel For Loops
  47. 47. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results
  48. 48. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results
  49. 49. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results
  50. 50. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results 4x 71x
  51. 51. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Dual Proc., 16 Threads 43.862 22,867.185 Xeon® Dual Proc., 32 Threads 46.247 21,688.103 Xeon® Phi, 61 Threads 11.366 88,246.452 Xeon® Phi, 122 Threads 8.772 114,338.399 Xeon® Phi, 183 Threads 10.546 94,946.364 Xeon® Phi, 244 Threads 12.696 78,999.44 3rd Comparison: Multithreading Results 4x 71x Phi now 5 times faster
  52. 52. Further Optimizations
  53. 53. Further Optimizations 1. Padded arrays
  54. 54. Further Optimizations 1. Padded arrays 2. Streaming stores
  55. 55. Further Optimizations 1. Padded arrays 2. Streaming stores 3. Huge memory pages
  56. 56. Optimization 1: Padded Arrays Optimizing Cache Access
  57. 57. Optimization 1: Padded Arrays • We can add extra, unused data to the end of each row Optimizing Cache Access
  58. 58. Optimization 1: Padded Arrays • We can add extra, unused data to the end of each row • Doing so aligns heavily used memory addresses for efficient cache line access Optimizing Cache Access
  59. 59. Optimization 1: Padded Arrays
  60. 60. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  61. 61. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  62. 62. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  63. 63. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  64. 64. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  65. 65. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  66. 66. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; }
  67. 67. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; } ((5900*sizeof(real)+63)/64)*(64/sizeof(real));
  68. 68. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; } (real *)_mm_malloc(size, kPaddingSize); (real *)_mm_malloc(size, kPaddingSize); sizeof(real)* width*kPaddingSize * height; ((5900*sizeof(real)+63)/64)*(64/sizeof(real));
  69. 69. Optimization 1: Padded Arrays static const size_t kPaddingSize = 64; int main(int argc, const char **argv) { int height = 10000; int width = 5900; int count = 1000; ! size_t size = sizeof(real) * width * height; real *fin = (real *)malloc(size); real *fout = (real *)malloc(size); ! weight_t weight = { .center = 0.99, .next = 0.00125, .diagonal = 0.00125 }; stencil_9pt(fin, fout, width, height, weight, count); ! // ...save results ! free(fin); free(fout); return 0; } _mm_free(fin); _mm_free(fout); (real *)_mm_malloc(size, kPaddingSize); (real *)_mm_malloc(size, kPaddingSize); sizeof(real)* width*kPaddingSize * height; ((5900*sizeof(real)+63)/64)*(64/sizeof(real));
  70. 70. Optimization 1: Padded Arrays Accommodating for Padding
  71. 71. Optimization 1: Padded Arrays #pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... } Accommodating for Padding
  72. 72. Optimization 1: Padded Arrays #pragma omp parallel for for (int y = 1; y < height - 1; ++y) { ! // ...calculate center, east, northwest, etc. int center = 1 + y * kPaddingSize + 1; int north = center - kPaddingSize; int south = center + kPaddingSize; int east = center + 1; int west = center - 1; int northwest = north - 1; int northeast = north + 1; int southwest = south - 1; int southeast = south + 1; ! #pragma ivdep // ... } Accommodating for Padding
  73. 73. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 11.644 86,138.371 Xeon® Phi, 122 Threads 8.973 111,774.803 Xeon® Phi, 183 Threads 10.326 97,132.546 Xeon® Phi, 244 Threads 11.469 87,452.707 Optimization 1: Padded Arrays Results
  74. 74. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 11.644 86,138.371 Xeon® Phi, 122 Threads 8.973 111,774.803 Xeon® Phi, 183 Threads 10.326 97,132.546 Xeon® Phi, 244 Threads 11.469 87,452.707 Optimization 1: Padded Arrays Results
  75. 75. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 11.644 86,138.371 Xeon® Phi, 122 Threads 8.973 111,774.803 Xeon® Phi, 183 Threads 10.326 97,132.546 Xeon® Phi, 244 Threads 11.469 87,452.707 Optimization 1: Padded Arrays Results
  76. 76. Optimization 2: Streaming Stores Read-less Writes
  77. 77. Optimization 2: Streaming Stores Read-less Writes • By default, Xeon® Phi processors read the value at an address before writing to that address.
  78. 78. Optimization 2: Streaming Stores Read-less Writes • By default, Xeon® Phi processors read the value at an address before writing to that address. • When calculating the weighted average for a pixel in our program, we do not use the original value of that pixel. Therefore, enabling streaming stores should result in better performance.
  79. 79. Optimization 2: Streaming Stores for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Read-less Writes with Vector Nontemporal
  80. 80. Optimization 2: Streaming Stores for (int i = 0; i < count; ++i) { for (int y = 1; y < height - 1; ++y) { // ...calculate center, east, northwest, etc. #pragma ivdep #pragma vector nontemporal for (int x = 1; x < width - 1; ++x) { fout[center] = weight.diagonal * fin[northwest] + weight.next * fin[west] + // ...add weighted, adjacent pixels weight.center * fin[center]; // ...increment locations ++center; ++north; ++northeast; } } // ... } Read-less Writes with Vector Nontemporal
  81. 81. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 13.588 73,978.915 Xeon® Phi, 122 Threads 8.491 111,774.803 Xeon® Phi, 183 Threads 8.663 115,773.405 Xeon® Phi, 244 Threads 9.507 105,498.781 Optimization 2: Streaming Stores Results
  82. 82. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 13.588 73,978.915 Xeon® Phi, 122 Threads 8.491 111,774.803 Xeon® Phi, 183 Threads 8.663 115,773.405 Xeon® Phi, 244 Threads 9.507 105,498.781 Optimization 2: Streaming Stores Results
  83. 83. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 13.588 73,978.915 Xeon® Phi, 122 Threads 8.491 111,774.803 Xeon® Phi, 183 Threads 8.663 115,773.405 Xeon® Phi, 244 Threads 9.507 105,498.781 Optimization 2: Streaming Stores Results
  84. 84. Optimization 3: Huge Memory Pages
  85. 85. • Memory pages map virtual memory used by our program to physical memory Optimization 3: Huge Memory Pages
  86. 86. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buffer (TLB) Optimization 3: Huge Memory Pages
  87. 87. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buffer (TLB) • Mappings are traversed in a“page table walk” Optimization 3: Huge Memory Pages
  88. 88. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buffer (TLB) • Mappings are traversed in a“page table walk” • malloc and _mm_malloc use 4KB memory pages by default Optimization 3: Huge Memory Pages
  89. 89. • Memory pages map virtual memory used by our program to physical memory • Mappings are stored in a translation look-aside buffer (TLB) • Mappings are traversed in a“page table walk” • malloc and _mm_malloc use 4KB memory pages by default • By increasing the size of each memory page, traversal time may be reduced Optimization 3: Huge Memory Pages
  90. 90. Optimization 3: Huge Memory Pages size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);
  91. 91. Optimization 3: Huge Memory Pages size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);real *fin = (real *)mmap(0, size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);
  92. 92. Optimization 3: Huge Memory Pages size_t size = sizeof(real) * width * kPaddingSize * height; real *fin = (real *)_mm_malloc(size, kPaddingSize);real *fin = (real *)mmap(0, size, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE|MAP_HUGETLB, -1.0);
  93. 93. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 14.486 69,239.365 Xeon® Phi, 122 Threads 8.226 121,924.389 Xeon® Phi, 183 Threads 8.749 114,636.799 Xeon® Phi, 244 Threads 9.466 105,955.358 Results Optimization 3: Huge Memory Pages
  94. 94. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 14.486 69,239.365 Xeon® Phi, 122 Threads 8.226 121,924.389 Xeon® Phi, 183 Threads 8.749 114,636.799 Xeon® Phi, 244 Threads 9.466 105,955.358 Results Optimization 3: Huge Memory Pages
  95. 95. Processor Elapsed Wall Time (seconds) MegaFLOPS Xeon® Phi, 61 Threads 14.486 69,239.365 Xeon® Phi, 122 Threads 8.226 121,924.389 Xeon® Phi, 183 Threads 8.749 114,636.799 Xeon® Phi, 244 Threads 9.466 105,955.358 Results Optimization 3: Huge Memory Pages
  96. 96. Takeaways • The key to achieving high-performance is to use loop vectorization and multiple threads • Completely serial programs run faster on standard processors • Only properly designed programs achieve peak performance on an Intel® Xeon® Phi Coprocessor • Other optimizations may be used to tweak performance • Data padding, • Streaming stores • Huge memory pages
  97. 97. Sources and Additional Resources • Today’s slides • http://modocache.io/xeon-phi-high-performance • Intel® Xeon® Phi Coprocessor High Performance Programming (James Jeffers, James Reinders) • http://www.amazon.com/dp/0124104142 • Intel Documentation • ivdep: https://software.intel.com/sites/products/ documentation/doclib/iss/2013/compiler/cpp-lin/ GUID-B25ABCC2-BE6F-4599-AEDF-2434F4676E1B.htm • vector: https://software.intel.com/sites/products/ documentation/studio/composer/en-us/2011Update/ compiler_c/cref_cls/common/ cppref_pragma_vector.htm
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×