• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization and Vectorization
 

Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization and Vectorization

on

  • 869 views

ZIB use Xeon Phi to achieve their Connected Compenent Labeling strategy #ISC13 #HPC

ZIB use Xeon Phi to achieve their Connected Compenent Labeling strategy #ISC13 #HPC

Statistics

Views

Total Views
869
Views on SlideShare
854
Embed Views
15

Actions

Likes
1
Downloads
0
Comments
0

1 Embed 15

https://twitter.com 15

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization and Vectorization Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization and Vectorization Presentation Transcript

    • Florian WendeZuse-Institute BerlinConnected ComponentLabeling on Xeon PhiParallelization & Vectorization
    • wende@zib.de Connected Component Labeling on Xeon Phi 1ISC13, LeipzigConnected Component LabelingSuppose we are given the following image . . .
    • . . . and we are to assign unique labels to different connected regions!Connected Component Labelingwende@zib.de Connected Component Labeling on Xeon Phi 1ISC13, Leipzig
    • . . . and we are to assign unique labels to different connected regions!. . . In parallel? Computer VisionDetect connected regions in images Computational PhysicsCluster algorithms for the Ising model Percolation TheoryHow to achieve the labeling? . . .Connected Component Labelingwende@zib.de Connected Component Labeling on Xeon Phi 2ISC13, Leipzig
    • 1. Labeling algorithm2. Parallelizationa. Parallel implementation on CPUb. Run the CPU code on the Xeon Phic. Adapt the code for the Xeon Phi3. Vectorization (SIMD)d. Leave it to the compiler (auto-vectorization)e. SIMD intrinsic functionsXeon Phi: 512-Bit SIMD unit for 16 x 32-bit wordsConnected Component Labeling - Strategywende@zib.de Connected Component Labeling on Xeon Phi 3ISC13, Leipzig
    •  Breadth/Depth first search algorithm, multi-pass algorithms Hoshen-Kopelman algorithm Cluster self-labeling algorithm by Coddington and Baillie1. Assign a unique label to each pixel of the image2. For each pixel consider its adjacent connected pixels in positive 1-, 2-, . . .direction and set the respective labels to the minimum value each3. If for all pixels the minimum operation is the identity function: Finished!Otherwise: Continue with step 2CPU: Hoshen-KopelmanXeon Phi: Hoshen-Kopelman vs. Cluster self-labelingConnected Component Labeling - Algorithmwende@zib.de Connected Component Labeling on Xeon Phi 4ISC13, Leipzig
    • Partition the image into equal-sized sub-images, and label themindependently using multiple threadsConnected Comp. Labeling - Parallelizationwende@zib.de Connected Component Labeling on Xeon Phi 5ISC13, Leipzig
    • Partition the image into equal-sized sub-images, and label themindependently using multiple threads Unique labels acrossdifferent sub-images Connected regions thatextend over multiple sub-images are merged after thelabeling using atomicprimitivesThread 0Thread 2Thread 4Thread 6Thread 1Thread 3Thread 5Thread 7Connected Comp. Labeling - Parallelizationwende@zib.de Connected Component Labeling on Xeon Phi 5ISC13, Leipzig
    • Example: Self-labeling within sub-image of thread 2 Process multiple data simultaneously using SIMD instructionsConnected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
    •  Process multiple data simultaneously using SIMD instructions1. Initialize labeling (array index)Example: Self-labeling within sub-image of thread 2Connected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
    • 1. Initialize labeling (array index)2. Load row[0] into reg0, andcreate mask for adjacententries in positive 1-direction:1 if equal-colored0 otherwiseExample: Self-labeling within sub-image of thread 2 Process multiple data simultaneously using SIMD instructions1-directionConnected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
    • 1. Initialize labeling (array index)2. Load row[0] into reg0, andcreate mask for adjacententries in positive 1-direction:1 if equal-colored0 otherwise3. Overlap each element in reg0 with itsadjacent element in positive 1-direction,and write the result to reg1Example: Self-labeling within sub-image of thread 2 Process multiple data simultaneously using SIMD instructionsConnected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
    • 4. Determine the pairwiseminimum of the entries in reg0and reg1 using the mask, andwrite the result to reg1Connected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
    • 4. Determine the pairwiseminimum of the entries in reg0and reg1 using the mask, andwrite the result to reg15. Write back entries in reg1 torow[0] using the maskConnected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
    • 4. Determine the pairwiseminimum of the entries in reg0and reg1 using the mask, andwrite the result to reg15. Write back entries in reg1 torow[0] using the mask6. Shift all elements in reg1 oneposition in positive 1-direction, shiftingin the 0-th element, and write the result to reg1Connected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
    • 4. Determine the pairwiseminimum of the entries in reg0and reg1 using the mask, andwrite the result to reg15. Write back entries in reg1 torow[0] using the mask6. Shift all elements in reg1 oneposition in positive 1-direction, shiftingin the 0-th element, and write the result to reg17. Shift all bits in mask one position up, and write the pairwise minimumentries in row[0] and reg1 to row[0] using the shifted maskConnected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
    • 4. Determine the pairwiseminimum of the entries in reg0and reg1 using the mask, andwrite the result to reg15. Write back entries in reg1 torow[0] using the mask6. Shift all elements in reg1 oneposition in positive 1-direction, shiftingin the 0-th element, and write the result to reg17. Shift all bits in mask one position up, and write the pairwise minimumentries in row[0] and reg1 to row[0] using the shifted mask8. Did labels change?Connected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
    • Result of the operations up to now . . .Set adjacent connectedelements in row[0] to thepairwise minimum value eachBeforeAfterRepeat the procedure for the 2-direction.1-direction2-directionConnected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 7ISC13, Leipzig
    • Repeat the procedure for all other rows as long as labels change . . .BeforeAfterNow: Merge labels across different sub-images using atomics!Finished!Connected Comp. Labeling - Vectorizationwende@zib.de Connected Component Labeling on Xeon Phi 8ISC13, Leipzig
    • CPU: Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHz Hoshen-Kopelman algorithm + Atomics for label merging Vectorization was left to the compiler: there are no masked SIMD intrinsics!Xeon Phi: 60 Cores + 4-way Hyper-Threading @ 1.1GHz Hoshen-Kopelman vs. Cluster self-labeling + Atomics for label merging Vectorization by means of _mm512_[mask]_XXX() instrinsicsParallelization by means of OpenMP: #pragma omp parallel {...}Programming effort: approx. 2-3 days for the CPU code (incl. optimization)less than 1 day for the Xeon Phi code (based on CPU code)Connected Comp. Labeling - Benchmarkwende@zib.de Connected Component Labeling on Xeon Phi 9ISC13, Leipzig
    • CPU: Intel Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHzXeon Phi: 60 Cores + 4-way Hyper-Threading @ 1.1GHzApplication: Swendsen-Wang cluster algorithm for the 2D Ising modelConnected Comp. Labeling - Benchmarkwende@zib.de Connected Component Labeling on Xeon Phi 10ISC13, Leipzig
    • CPU: Intel Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHzXeon Phi: 60 Cores + 4-way Hyper-Threading @ 1.1GHzApplication: Swendsen-Wang cluster algorithm for the 2D Ising modelConnected Comp. Labeling - Benchmarkwende@zib.de Connected Component Labeling on Xeon Phi 10ISC13, Leipzig
    • Work partially funded byBMBF Grant No. 01IH11004GDr. Thomas SteinkeZuse-Institute Berlin (ZIB)Dr. Michael KlemmIntel GmbH, GermanyAcknowledgementwende@zib.de Connected Component Labeling on Xeon Phi 11ISC13, Leipzig
    • [1] C. F. Baillie and P. D. Coddington. Cluster Identification Algorithmsfor Spin Models – Sequential and Parallel, 1991.[2] Hoshen, J. and Kopelman, R. Percolation and Cluster Distribution.I. Cluster Multiple Labeling Technique and Critical Concentration Algorithm.Phys. Rev. B 14, 3438–3445, 1976[3] R. H. Swendsen and J.-S. Wang. Nonuniversal Critical Dynamics inMonte Carlo Simulations. Phys. Rev. Lett., 58:86–88, Jan 1987.[4] Intel Corp. Intel Xeon Phi Coprocessor 5110P, Product Brief, 2012.Referenceswende@zib.de Connected Component Labeling on Xeon Phi 12ISC13, Leipzig