Florian WendeZuse-Institute BerlinConnected ComponentLabeling on Xeon PhiParallelization & Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 1ISC13, LeipzigConnected Component LabelingSuppose we are given the ...
. . . and we are to assign unique labels to different connected regions!Connected Component Labelingwende@zib.de Connected...
. . . and we are to assign unique labels to different connected regions!. . . In parallel? Computer VisionDetect connecte...
1. Labeling algorithm2. Parallelizationa. Parallel implementation on CPUb. Run the CPU code on the Xeon Phic. Adapt the co...
 Breadth/Depth first search algorithm, multi-pass algorithms Hoshen-Kopelman algorithm Cluster self-labeling algorithm ...
Partition the image into equal-sized sub-images, and label themindependently using multiple threadsConnected Comp. Labelin...
Partition the image into equal-sized sub-images, and label themindependently using multiple threads Unique labels acrossd...
Example: Self-labeling within sub-image of thread 2 Process multiple data simultaneously using SIMD instructionsConnected...
 Process multiple data simultaneously using SIMD instructions1. Initialize labeling (array index)Example: Self-labeling w...
1. Initialize labeling (array index)2. Load row[0] into reg0, andcreate mask for adjacententries in positive 1-direction:1...
1. Initialize labeling (array index)2. Load row[0] into reg0, andcreate mask for adjacententries in positive 1-direction:1...
4. Determine the pairwiseminimum of the entries in reg0and reg1 using the mask, andwrite the result to reg1Connected Comp....
4. Determine the pairwiseminimum of the entries in reg0and reg1 using the mask, andwrite the result to reg15. Write back e...
4. Determine the pairwiseminimum of the entries in reg0and reg1 using the mask, andwrite the result to reg15. Write back e...
4. Determine the pairwiseminimum of the entries in reg0and reg1 using the mask, andwrite the result to reg15. Write back e...
4. Determine the pairwiseminimum of the entries in reg0and reg1 using the mask, andwrite the result to reg15. Write back e...
Result of the operations up to now . . .Set adjacent connectedelements in row[0] to thepairwise minimum value eachBeforeAf...
Repeat the procedure for all other rows as long as labels change . . .BeforeAfterNow: Merge labels across different sub-im...
CPU: Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHz Hoshen-Kopelman algorithm + Atomics for label merging Vector...
CPU: Intel Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHzXeon Phi: 60 Cores + 4-way Hyper-Threading @ 1.1GHzApplic...
CPU: Intel Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHzXeon Phi: 60 Cores + 4-way Hyper-Threading @ 1.1GHzApplic...
Work partially funded byBMBF Grant No. 01IH11004GDr. Thomas SteinkeZuse-Institute Berlin (ZIB)Dr. Michael KlemmIntel GmbH,...
[1] C. F. Baillie and P. D. Coddington. Cluster Identification Algorithmsfor Spin Models – Sequential and Parallel, 1991.[...
Upcoming SlideShare
Loading in …5
×

Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization and Vectorization

1,693
-1

Published on

ZIB use Xeon Phi to achieve their Connected Compenent Labeling strategy #ISC13 #HPC

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,693
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization and Vectorization

  1. 1. Florian WendeZuse-Institute BerlinConnected ComponentLabeling on Xeon PhiParallelization & Vectorization
  2. 2. wende@zib.de Connected Component Labeling on Xeon Phi 1ISC13, LeipzigConnected Component LabelingSuppose we are given the following image . . .
  3. 3. . . . and we are to assign unique labels to different connected regions!Connected Component Labelingwende@zib.de Connected Component Labeling on Xeon Phi 1ISC13, Leipzig
  4. 4. . . . and we are to assign unique labels to different connected regions!. . . In parallel? Computer VisionDetect connected regions in images Computational PhysicsCluster algorithms for the Ising model Percolation TheoryHow to achieve the labeling? . . .Connected Component Labelingwende@zib.de Connected Component Labeling on Xeon Phi 2ISC13, Leipzig
  5. 5. 1. Labeling algorithm2. Parallelizationa. Parallel implementation on CPUb. Run the CPU code on the Xeon Phic. Adapt the code for the Xeon Phi3. Vectorization (SIMD)d. Leave it to the compiler (auto-vectorization)e. SIMD intrinsic functionsXeon Phi: 512-Bit SIMD unit for 16 x 32-bit wordsConnected Component Labeling - Strategywende@zib.de Connected Component Labeling on Xeon Phi 3ISC13, Leipzig
  6. 6.  Breadth/Depth first search algorithm, multi-pass algorithms Hoshen-Kopelman algorithm Cluster self-labeling algorithm by Coddington and Baillie1. Assign a unique label to each pixel of the image2. For each pixel consider its adjacent connected pixels in positive 1-, 2-, . . .direction and set the respective labels to the minimum value each3. If for all pixels the minimum operation is the identity function: Finished!Otherwise: Continue with step 2CPU: Hoshen-KopelmanXeon Phi: Hoshen-Kopelman vs. Cluster self-labelingConnected Component Labeling - Algorithmwende@zib.de Connected Component Labeling on Xeon Phi 4ISC13, Leipzig
  7. 7. Partition the image into equal-sized sub-images, and label themindependently using multiple threadsConnected Comp. Labeling - Parallelizationwende@zib.de Connected Component Labeling on Xeon Phi 5ISC13, Leipzig
  8. 8. Partition the image into equal-sized sub-images, and label themindependently using multiple threads Unique labels acrossdifferent sub-images Connected regions thatextend over multiple sub-images are merged after thelabeling using atomicprimitivesThread 0Thread 2Thread 4Thread 6Thread 1Thread 3Thread 5Thread 7Connected Comp. Labeling - Parallelizationwende@zib.de Connected Component Labeling on Xeon Phi 5ISC13, Leipzig
  9. 9. Example: Self-labeling within sub-image of thread 2 Process multiple data simultaneously using SIMD instructionsConnected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  10. 10.  Process multiple data simultaneously using SIMD instructions1. Initialize labeling (array index)Example: Self-labeling within sub-image of thread 2Connected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  11. 11. 1. Initialize labeling (array index)2. Load row[0] into reg0, andcreate mask for adjacententries in positive 1-direction:1 if equal-colored0 otherwiseExample: Self-labeling within sub-image of thread 2 Process multiple data simultaneously using SIMD instructions1-directionConnected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  12. 12. 1. Initialize labeling (array index)2. Load row[0] into reg0, andcreate mask for adjacententries in positive 1-direction:1 if equal-colored0 otherwise3. Overlap each element in reg0 with itsadjacent element in positive 1-direction,and write the result to reg1Example: Self-labeling within sub-image of thread 2 Process multiple data simultaneously using SIMD instructionsConnected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  13. 13. 4. Determine the pairwiseminimum of the entries in reg0and reg1 using the mask, andwrite the result to reg1Connected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  14. 14. 4. Determine the pairwiseminimum of the entries in reg0and reg1 using the mask, andwrite the result to reg15. Write back entries in reg1 torow[0] using the maskConnected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  15. 15. 4. Determine the pairwiseminimum of the entries in reg0and reg1 using the mask, andwrite the result to reg15. Write back entries in reg1 torow[0] using the mask6. Shift all elements in reg1 oneposition in positive 1-direction, shiftingin the 0-th element, and write the result to reg1Connected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  16. 16. 4. Determine the pairwiseminimum of the entries in reg0and reg1 using the mask, andwrite the result to reg15. Write back entries in reg1 torow[0] using the mask6. Shift all elements in reg1 oneposition in positive 1-direction, shiftingin the 0-th element, and write the result to reg17. Shift all bits in mask one position up, and write the pairwise minimumentries in row[0] and reg1 to row[0] using the shifted maskConnected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  17. 17. 4. Determine the pairwiseminimum of the entries in reg0and reg1 using the mask, andwrite the result to reg15. Write back entries in reg1 torow[0] using the mask6. Shift all elements in reg1 oneposition in positive 1-direction, shiftingin the 0-th element, and write the result to reg17. Shift all bits in mask one position up, and write the pairwise minimumentries in row[0] and reg1 to row[0] using the shifted mask8. Did labels change?Connected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  18. 18. Result of the operations up to now . . .Set adjacent connectedelements in row[0] to thepairwise minimum value eachBeforeAfterRepeat the procedure for the 2-direction.1-direction2-directionConnected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 7ISC13, Leipzig
  19. 19. Repeat the procedure for all other rows as long as labels change . . .BeforeAfterNow: Merge labels across different sub-images using atomics!Finished!Connected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 8ISC13, Leipzig
  20. 20. CPU: Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHz Hoshen-Kopelman algorithm + Atomics for label merging Vectorization was left to the compiler: there are no masked SIMD intrinsics!Xeon Phi: 60 Cores + 4-way Hyper-Threading @ 1.1GHz Hoshen-Kopelman vs. Cluster self-labeling + Atomics for label merging Vectorization by means of _mm512_[mask]_XXX() instrinsicsParallelization by means of OpenMP: #pragma omp parallel {...}Programming effort: approx. 2-3 days for the CPU code (incl. optimization)less than 1 day for the Xeon Phi code (based on CPU code)Connected Comp. Labeling - Benchmarkwende@zib.de Connected Component Labeling on Xeon Phi 9ISC13, Leipzig
  21. 21. CPU: Intel Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHzXeon Phi: 60 Cores + 4-way Hyper-Threading @ 1.1GHzApplication: Swendsen-Wang cluster algorithm for the 2D Ising modelConnected Comp. Labeling - Benchmarkwende@zib.de Connected Component Labeling on Xeon Phi 10ISC13, Leipzig
  22. 22. CPU: Intel Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHzXeon Phi: 60 Cores + 4-way Hyper-Threading @ 1.1GHzApplication: Swendsen-Wang cluster algorithm for the 2D Ising modelConnected Comp. Labeling - Benchmarkwende@zib.de Connected Component Labeling on Xeon Phi 10ISC13, Leipzig
  23. 23. Work partially funded byBMBF Grant No. 01IH11004GDr. Thomas SteinkeZuse-Institute Berlin (ZIB)Dr. Michael KlemmIntel GmbH, GermanyAcknowledgementwende@zib.de Connected Component Labeling on Xeon Phi 11ISC13, Leipzig
  24. 24. [1] C. F. Baillie and P. D. Coddington. Cluster Identification Algorithmsfor Spin Models – Sequential and Parallel, 1991.[2] Hoshen, J. and Kopelman, R. Percolation and Cluster Distribution.I. Cluster Multiple Labeling Technique and Critical Concentration Algorithm.Phys. Rev. B 14, 3438–3445, 1976[3] R. H. Swendsen and J.-S. Wang. Nonuniversal Critical Dynamics inMonte Carlo Simulations. Phys. Rev. Lett., 58:86–88, Jan 1987.[4] Intel Corp. Intel Xeon Phi Coprocessor 5110P, Product Brief, 2012.Referenceswende@zib.de Connected Component Labeling on Xeon Phi 12ISC13, Leipzig

×