Getting Performance from Xeon Phi Easily

•Download as PPTX, PDF•

2 likes•1,152 views

This document discusses lessons learned from porting two applications, CalcuNetW and GammaMaps, to the Intel Xeon Phi coprocessor. CalcuNetW calculates measurements in complex networks using MKL libraries, while GammaMaps performs dose calculations for radiation therapy using OpenMP pragmas. With minimal modifications using only pragmas, both applications were able to run natively and offload work to the Xeon Phi. Results showed the Xeon Phi providing similar performance to a single Xeon CPU core but with poor I/O performance. Further optimization work is required to fully leverage the Xeon Phi's capabilities.

Technology

Can You Get Performance from
Xeon Phi Easily?
Lessons Learned from Two Real
Cases

Objective
• Check the amount of work to use Intel
Xeon Phi.
• Minimal modifications using only pragmas.
• Two applications:
– CalcunetW. Test MKL Libraries.
– GammaMaps. Test pragmas.
• Two modes:
– Native: Only compiled to execute on Xeon Phi
– Offload: Uses Host+Xeon Phi

CalcuNetw: Calculate Measurements in Complex Networks
• Complex networks, consisting of sets of
nodes or vertices joined together in pairs by
links or edges.
• Application Calculates for each network:
– Subgraph Centrality (SC): characterizes the
participation of each node in all subgraphs in a
network.
– SC odd: account only paths of long odd
– SC even: account only paths of long even
– Bipartivity: Is a proportion of even to total number of
closed walks in the network.
– Network Communicability for Connected Nodes:
C(p,q): Measures how well communicated are two
nodes in the network.
– Network Communicability C(G): is the mean of all
the C(p,q),
Mouriño J.C., Estrada E., Gomez A. “ CalcuNetw: Calculate Measurements in Complex Networks ”,Informe Técnico
CESGA-2005-003

GammaMaps: A figure-of-merit in Radiation
Therapy
X
Y
Z
Dose in voxel i,j,k
X
Y
Z

GammaMaps: A figure-of-merit in
Radiation Therapy
Read
Doses
Initialise and
normalise
Compute
Gamma
Store
Gamma
• Application in FORTRAN 90
• Parallelised using OpenMP
• Geometric algorithm*
• 512 x 512 x 128 = 33,554,432
voxels
• Auto-vectorization
• Pragmas for offload
* T. Ju, T. Simpson, J. O. Deasy, and D. A. Low, “Geometric interpretation of the γ dose distribution
comparison technique: Interpolation-free calculation,” Medical Physics, vol. 35, no. 3, p. 879, 2008.

Platform
Host
CPU Model Intel(R) Xeon(R) CPU E5-2680
0 @ 2.70GHz
Nr. of cores 16
Memory 32788 MB
Operating System Linux 2.6.32-279.el6.x86_64
Compiler Version 2013U2 Intel Xeon Phi
Model Beta0 Engineering Sample
Nr. of cores 61 at 1.09GHz
Memory 7936 MB
Operating System MPSS Gold U1
Compiler Version 2013U2
GDDR Technology GDDR5
GDDR Frecuency 2750000 KHz
• Remote
access to
Intel systems
• Feb. 2013

COMPACT - FINE
C1 C2 C3 C4
H
T
1
H
T
2
H
T
3
H
T
4
H
T
1
H
T
2
H
T
3
H
T
4
H
T
1
H
T
2
H
T
3
H
T
4
H
T
1
H
T
2
H
T
3
H
T
4
0 1 2 3 4 5 6 7
Intel Xeon Phi Affinity Policies
SCATTER - FINE
C1 C2 C3 C4
H
T
1
H
T
2
H
T
3
H
T
4
H
T
1
H
T
2
H
T
3
H
T
4
H
T
1
H
T
2
H
T
3
H
T
4
H
T
1
H
T
2
H
T
3
H
T
4
0 4 1 5 2 6 3 7
BALANCED - FINE
C1 C2 C3 C4
H
T
1
H
T
2
H
T
3
H
T
4
H
T
1
H
T
2
H
T
3
H
T
4
H
T
1
H
T
2
H
T
3
H
T
4
H
T
1
H
T
2
H
T
3
H
T
4
0 1 2 3 4 5 6 7
BALANCED - CORE
C1 C2 C3 C4
H
T
1
H
T
2
H
T
3
H
T
4
H
T
1
H
T
2
H
T
3
H
T
4
H
T
1
H
T
2
H
T
3
H
T
4
H
T
1
H
T
2
H
T
3
H
T
4
{0,1} {2,3} {4,5} {6,7}
• TYPE
– Compact
– Scatter
– Balanced
• Granularity
– Fine or Thread
– Core

Host
0
200
400
600
800
1000
1200
1400
0 5 10 15 20
ElapsedTime(s)
Nr. of Threads
Host
local-compact-core
local-compact-fine
local-scatter-fine
local-scatter-core

Conclusions
• Using MKL library is easy and does not
require changes in the code.
• Easy pragmas on code permit fast usage
• I/O performance issues in Xeon Phi
• 1 Xeon Phi ~ 1 Xeon E5-2680
• Improve performance requires additional
work.

Acknowledge
The authors would like to thank Intel for
providing access to Intel Xeon Phi
coprocessor.

Questions
Andrés Gómez
José Carlos Mouriño
Carmen Cotelo
Aurelio Rodríguez
The TEAM

Similar to Getting Performance from Xeon Phi Easily

Performance analysis and implementation of modified sdm based noc for mpsoc o...eSAT Journals

Secure remote protocol for fpga reconfigurationeSAT Journals

Secure remote protocol for fpga reconfigurationeSAT Publishing House

Implementation of resource sharing strategy for power optimization in embedde...Alexander Decker

IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline

Wired and Wireless Computer Network Performance Evaluation Using OMNeT++ Simu...Jaipal Dhobale

Investigating the Performance of NoC Using Hierarchical Routing ApproachIJERA Editor

IRJET- Re-Configuration Topology for On-Chip Networks by Back-TrackingIRJET Journal

Optimal configuration of networkjpstudcorner

Blue gene detail journalVivek Jha

RT15 Berkeley | ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...OPAL-RT TECHNOLOGIES

Optimal and Power Aware BIST for Delay Testing of System-On-ChipIDES Editor

Enhanced Leach Protocolijceronline

A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIPijaceeejournal

underground cable fault location using aruino,gsm&gps Mohd Sohail

blue gene pptRabindraRajSah

IRJET- An Enhanced Cluster (CH-LEACH) based Routing Scheme for Wireless Senso...IRJET Journal

Maxwell siuc hpc_description_tutorialmadhuinturi

Modification of l3 learning switch code for firewall functionality in pox con...eSAT Journals

Similar to Getting Performance from Xeon Phi Easily (20)

Performance analysis and implementation of modified sdm based noc for mpsoc o...

Secure remote protocol for fpga reconfiguration

Implementation of resource sharing strategy for power optimization in embedde...

IJCER (www.ijceronline.com) International Journal of computational Engineerin...

Wired and Wireless Computer Network Performance Evaluation Using OMNeT++ Simu...

Investigating the Performance of NoC Using Hierarchical Routing Approach

IRJET- Re-Configuration Topology for On-Chip Networks by Back-Tracking

Optimal configuration of network

Blue gene detail journal

RT15 Berkeley | ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...

Optimal and Power Aware BIST for Delay Testing of System-On-Chip

Enhanced Leach Protocol

A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP

underground cable fault location using aruino,gsm&gps

blue gene ppt

IRJET- An Enhanced Cluster (CH-LEACH) based Routing Scheme for Wireless Senso...

Maxwell siuc hpc_description_tutorial

Modification of l3 learning switch code for firewall functionality in pox con...

Recently uploaded

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

APIForce Zurich 5 April Automation LPDGMarianaLemus7

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Pigging Solutions in Pet Food ManufacturingPigging Solutions

AI as an Interface for Commercial BuildingsMemoori

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Story boards and shot lists for my a level piececharlottematthew16

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

costume and set research powerpoint presentationphoebematthew05

Recently uploaded (20)

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service

Dev Dives: Streamline document processing with UiPath Studio Web

APIForce Zurich 5 April Automation LPDG

DMCC Future of Trade Web3 - Special Edition

Pigging Solutions in Pet Food Manufacturing

AI as an Interface for Commercial Buildings

Unleash Your Potential - Namagunga Girls Coding Club

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Vertex AI Gemini Prompt Engineering Tips

Are Multi-Cloud and Serverless Good or Bad?

SIP trunking in Janus @ Kamailio World 2024

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Powerpoint exploring the locations used in television show Time Clash

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Story boards and shot lists for my a level piece

Developer Data Modeling Mistakes: From Postgres to NoSQL

Streamlining Python Development: A Guide to a Modern Project Setup

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

costume and set research powerpoint presentation

Getting Performance from Xeon Phi Easily

1. Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

2. Objective • Check the amount of work to use Intel Xeon Phi. • Minimal modifications using only pragmas. • Two applications: – CalcunetW. Test MKL Libraries. – GammaMaps. Test pragmas. • Two modes: – Native: Only compiled to execute on Xeon Phi – Offload: Uses Host+Xeon Phi

3. CalcuNetw: Calculate Measurements in Complex Networks • Complex networks, consisting of sets of nodes or vertices joined together in pairs by links or edges. • Application Calculates for each network: – Subgraph Centrality (SC): characterizes the participation of each node in all subgraphs in a network. – SC odd: account only paths of long odd – SC even: account only paths of long even – Bipartivity: Is a proportion of even to total number of closed walks in the network. – Network Communicability for Connected Nodes: C(p,q): Measures how well communicated are two nodes in the network. – Network Communicability C(G): is the mean of all the C(p,q), Mouriño J.C., Estrada E., Gomez A. “ CalcuNetw: Calculate Measurements in Complex Networks ”,Informe Técnico CESGA-2005-003

4. CalcuNetW

5. GammaMaps: A figure-of-merit in Radiation Therapy X Y Z Dose in voxel i,j,k X Y Z

6. GammaMaps: A figure-of-merit in Radiation Therapy Read Doses Initialise and normalise Compute Gamma Store Gamma • Application in FORTRAN 90 • Parallelised using OpenMP • Geometric algorithm* • 512 x 512 x 128 = 33,554,432 voxels • Auto-vectorization • Pragmas for offload * T. Ju, T. Simpson, J. O. Deasy, and D. A. Low, “Geometric interpretation of the γ dose distribution comparison technique: Interpolation-free calculation,” Medical Physics, vol. 35, no. 3, p. 879, 2008.

7. Results of Experiments

8. Platform Host CPU Model Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz Nr. of cores 16 Memory 32788 MB Operating System Linux 2.6.32-279.el6.x86_64 Compiler Version 2013U2 Intel Xeon Phi Model Beta0 Engineering Sample Nr. of cores 61 at 1.09GHz Memory 7936 MB Operating System MPSS Gold U1 Compiler Version 2013U2 GDDR Technology GDDR5 GDDR Frecuency 2750000 KHz • Remote access to Intel systems • Feb. 2013

9. COMPACT - FINE C1 C2 C3 C4 H T 1 H T 2 H T 3 H T 4 H T 1 H T 2 H T 3 H T 4 H T 1 H T 2 H T 3 H T 4 H T 1 H T 2 H T 3 H T 4 0 1 2 3 4 5 6 7 Intel Xeon Phi Affinity Policies SCATTER - FINE C1 C2 C3 C4 H T 1 H T 2 H T 3 H T 4 H T 1 H T 2 H T 3 H T 4 H T 1 H T 2 H T 3 H T 4 H T 1 H T 2 H T 3 H T 4 0 4 1 5 2 6 3 7 BALANCED - FINE C1 C2 C3 C4 H T 1 H T 2 H T 3 H T 4 H T 1 H T 2 H T 3 H T 4 H T 1 H T 2 H T 3 H T 4 H T 1 H T 2 H T 3 H T 4 0 1 2 3 4 5 6 7 BALANCED - CORE C1 C2 C3 C4 H T 1 H T 2 H T 3 H T 4 H T 1 H T 2 H T 3 H T 4 H T 1 H T 2 H T 3 H T 4 H T 1 H T 2 H T 3 H T 4 {0,1} {2,3} {4,5} {6,7} • TYPE – Compact – Scatter – Balanced • Granularity – Fine or Thread – Core

10. Results for CalcunetW

11. CalcunetW

12. CalcunetW

13. CalcunetW

14. Results for GammaMaps

15. GammaMaps

16. Host 0 200 400 600 800 1000 1200 1400 0 5 10 15 20 ElapsedTime(s) Nr. of Threads Host local-compact-core local-compact-fine local-scatter-fine local-scatter-core

17. GammaMaps

18. Xeon Phi poor I/O

19. Conclusions • Using MKL library is easy and does not require changes in the code. • Easy pragmas on code permit fast usage • I/O performance issues in Xeon Phi • 1 Xeon Phi ~ 1 Xeon E5-2680 • Improve performance requires additional work.

20. Acknowledge The authors would like to thank Intel for providing access to Intel Xeon Phi coprocessor.

21. Questions Andrés Gómez José Carlos Mouriño Carmen Cotelo Aurelio Rodríguez The TEAM

Getting Performance from Xeon Phi Easily

Recommended

Recommended

More Related Content

Similar to Getting Performance from Xeon Phi Easily

Similar to Getting Performance from Xeon Phi Easily (20)

More from Andrés Gómez

More from Andrés Gómez (7)

Recently uploaded

Recently uploaded (20)

Getting Performance from Xeon Phi Easily