SlideShare a Scribd company logo
1 of 21
Download to read offline
OverlappingCommunicationand
ComputationbyUsingaHybrid
MPI/SMPSsApproach
ReadingCircileinTauralaboratory
B4TakuyaFukuoka
November29,2018
1
Aboutthispaper
Author:VladimirMarjanovic,JosepM.Perez,EduardAyguade,
JesusLabarta,MateoValero
TheyarefromBarcelonaSupercomputingCenter.
ThispaperispublishedinICS'10Proceedingsofthe24thACM
InternationalConferenceonSupercomputing
Iwillintroducethepaperwiththeversionuploadedhere
Numberofcitationsis78(atNovember2018)
ThispaperiscitedbyMPI+ULT,HCMPI(Habanero‑C+MPI),
MPIQ(MPI+Qthreads)
2
Introduction(1)
ThispaperfocusedonHybriduseofMPIandTaskParallelism
(SMPSs)
TherearetwokindsofMPIfunction
Blockingfunction(MPI_Send,MPI_Recv)
Non‑blockingfunction(MPI_Isend,MPI_Irecv)
Theadvantageofnon‑blockingfunction
Communication/computationoverwrap
Thedisadvantageofnon‑blockingfunction
Codecomplexity
Lessprogrammingproductivity
3
Introduction(2)
MPI/SMPSsapproach
Enablecommunication/computationoverwrapwithsimple
annotation(lesscodecomplexity)
Betoleratetonetworkbandwidthandpreemptions.
ArestartmechanisimisadoptedinMPI/SMPSs
EvaluatedbyHPL(LINPACKbenchmark)
4
Topics
SMPSs
HybridMPI/SMPSs
HPLbenchmark
Performanceresults
Mythoughts
5
SMPSs(1)
SMPsuperscalarprogrammingmodel
EtensionofthestandardC/Fortranprogramminglanguage
Source‑to‑sourcecompiler
Usepragmas/directivestodeclarefunctionsthatarepotential
tasks
#pragma css task input(n) output(result)
void factorial (unsigned int n, unsigned int *result)
{
*result = 1;
for (; n > 1; n--)
*result = *result * n;
}
void main(){
// executed in one piece as a task
factorial(n, result);
}
6
SMPSs(2)
#pragma css task [clause-list]
{ function-header | function-definition}
Therearefourkindsofclause
input(data‑reference‑list)
output(data‑reference‑list)
inout(data‑reference‑list)
highpriority
Thefirstthreeclause(input,output,inout)isusedtodeterminethe
dependencyoftasks
Thehighpriorityclauseisusedtospecifytheprioritywhen
schedulingtasks.
7
SMPSs(3)
# pragma css barrier
# pragma css wait on <list of variables>
TheSMPSsprogrammingmodelcanaddabarrier
Inthefirstexample,theexecutionisstoppeduntilallthetasksis
finished.
Inthesecondexample,theexecutionisstoppeduntilthevariables
in<listofvariables>arecompletelyupdated.
8
SMPSsinternals(1)
Thecompilerreplacecallsdefinedastaskswithcallstothe
css_addTaskfunction
specifiedthefunctiontobeexecutedanditsarguments
css_addTaskfunctionbuildadependancetaskgraph
considermemoryaddress,sizeanddirectionofeach
parameter
Onceataskisfinished,theglobalreadyqueueisupdated.
alltaskswithnopendingdependencesisinsertedintothe
globalreadyqueue
therearetwoglobalreadyqueues:forhigh‑priorityandfor
low‑priority
9
SMPSsinternals(2)
Eachworkerthreadshaveownreadyqueue,andlookforready
taskinglobalreadyqueue
Thereareworkstealingmechanism
fromSMPSuperscalar(SMPSs)User'sManual
10
SMPSsinternals(3)
Inordertoavoidfalsedependancies,therenamingmechanismis
implemented(likeregisterrenaminginsuperscalarprocessor)
11
HybridMPI/SMPSs(1)
EncapsulateMPIfunctionswithdirectiveastasks
Blockingfunctionissplitintoanon‑blockingcallandawaitcall
#pragma css task output(buf, req)
void recv (<type>
buf[count], MPI_Request *req){
MPI_Irecv(buf,... ,req);
}
12
HybridMPI/SMPSs(2)
# pragma css restart
Inordertoavoiddeadlock,therestartpragmaisintroduced
Aborttheexecutionofcurrenttaskandputitagaininthe
readyqueue
#pragma css task input(req)
void wait (MPI_Request *req){
int go;
MPI_Test (req, &go, ...);
if (go==0) #pragma css restart;
MPI_Wait (req_recv, ...);
}
void application_receive(){
recv ();
wait ();
}
13
HPLbenchmark(1)
Themostwidelyusedbenchmarktomeasurethefloating‑pointexecutionrateof
acomputer
Thekernelsolvesasystemoflinearequations
ImplementLUdecompositionwithpartialpivoting
Introducelook‑aheadtechniquetooverwrapcomputation/communication
14
HPLbenchmark(2)
Thisistheflowwithmultipleprocesses
KeepingoneMPIcallinoneprocessbyusingthebarrier
15
Performanceresults(1)
Basiccomparison
Inspiteoftheoverheadofabstraction,
MPI/SMPSsshowbetterperformance
thanpureMPI.
Theuseofnon‑blockingMPI
calls
Theefficientuseofmemoryby
adoptinglook‑aheadtechnique
Withtheincreasethematrizsize,the
performancerateincreases
Influenceofcommunication
overheadissmaller 16
Performanceresults(2)
Tolerancetolowbandwidth
ForeachmessageofsizeS,an
additionalmessageofsizef*Sis
transferredbetweentwodummybuffers
atsenderandreceiver
Normalizedeffectivebandwidthmeans
1/(1+f)
PureMPIversionismuchmoresensitive
17
Performanceresults(3)
TolerancetoOSnoise
Generatinganadditionalthreadper
processthatiteratesonaloopthat
alternatessleepingandcomputing
phases.
Durationofperiodicalcomputationis
fixedat0.5s
Durationofperiodicalsleepingis
changed
PureMPIversionismuchmore
sensitive
18
Futurework
ImplementMPIcollectiveoperationbycombiningthehybrid
MPI/SMPSsprogrammingmodelandnon‑blockingMPIcollective
operationslibrary
19
MyThoughts(1)
ItisunfairtocomparehybridMPI/SMPSswithpureMPI.Ithinkthe
non‑blockingMPIversionwithcodecomplexityhavetobe
measured.
AsfarasIknow,itisimpossibletonesttasksinSMPSsmodelsand
thisrestrictstheprogramability.
HybridMPI/SMPSsmodelscannotdealwithinvocationofMPI
callsfrommultiplethreads.Theprogrammerhavetoserializethe
MPIcallsusingbarrier.Inthispoint,thispaperisuselessformy
research.
20
MyThoughts(2)
ThemeasurementoftolerancetolowbandwidthandOSnoiseis
interestingforme.Thisideacantakeintomyresearch.
21

More Related Content

Similar to Overlapping Communication and Computation by Using a Hybrid MPI/SMPSs Approach

IC-SDV 2018: Patrick Fievet (WIPO) Automatic Categorization of Patent Documen...
IC-SDV 2018: Patrick Fievet (WIPO) Automatic Categorization of Patent Documen...IC-SDV 2018: Patrick Fievet (WIPO) Automatic Categorization of Patent Documen...
IC-SDV 2018: Patrick Fievet (WIPO) Automatic Categorization of Patent Documen...
Dr. Haxel Consult
 
An Efficient Hardware Implementation of Canny Edge Detection Algorithm
An Efficient Hardware Implementation of Canny Edge Detection AlgorithmAn Efficient Hardware Implementation of Canny Edge Detection Algorithm
An Efficient Hardware Implementation of Canny Edge Detection Algorithm
ijtsrd
 
Performance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4SPerformance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4S
Ganesan Narayanasamy
 

Similar to Overlapping Communication and Computation by Using a Hybrid MPI/SMPSs Approach (20)

Cutting edge cloud technologies: 5G, Cloud and IoT, Fog computing
Cutting edge cloud technologies: 5G, Cloud and IoT, Fog computingCutting edge cloud technologies: 5G, Cloud and IoT, Fog computing
Cutting edge cloud technologies: 5G, Cloud and IoT, Fog computing
 
Communication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big DataCommunication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big Data
 
Advanced Scalable Decomposition Method with MPICH Environment for HPC
Advanced Scalable Decomposition Method with MPICH Environment for HPCAdvanced Scalable Decomposition Method with MPICH Environment for HPC
Advanced Scalable Decomposition Method with MPICH Environment for HPC
 
OpenACC Monthly Highlights February 2019
OpenACC Monthly Highlights February 2019OpenACC Monthly Highlights February 2019
OpenACC Monthly Highlights February 2019
 
OpenACC Monthly Highlights February 2019
OpenACC Monthly Highlights February 2019OpenACC Monthly Highlights February 2019
OpenACC Monthly Highlights February 2019
 
IC-SDV 2018: Patrick Fievet (WIPO) Automatic Categorization of Patent Documen...
IC-SDV 2018: Patrick Fievet (WIPO) Automatic Categorization of Patent Documen...IC-SDV 2018: Patrick Fievet (WIPO) Automatic Categorization of Patent Documen...
IC-SDV 2018: Patrick Fievet (WIPO) Automatic Categorization of Patent Documen...
 
USING ORFEO TOOLBOX A GROWING COMPETENCE IN A COLLABORATIVE ENVIRONMENT
USING ORFEO TOOLBOX A GROWING COMPETENCE IN A COLLABORATIVE ENVIRONMENTUSING ORFEO TOOLBOX A GROWING COMPETENCE IN A COLLABORATIVE ENVIRONMENT
USING ORFEO TOOLBOX A GROWING COMPETENCE IN A COLLABORATIVE ENVIRONMENT
 
An Efficient Hardware Implementation of Canny Edge Detection Algorithm
An Efficient Hardware Implementation of Canny Edge Detection AlgorithmAn Efficient Hardware Implementation of Canny Edge Detection Algorithm
An Efficient Hardware Implementation of Canny Edge Detection Algorithm
 
OpenPOWER's ISC 2016 Recap
OpenPOWER's ISC 2016 RecapOpenPOWER's ISC 2016 Recap
OpenPOWER's ISC 2016 Recap
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
 
OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020
 
Mmsys slideshare-intel-nokia
Mmsys slideshare-intel-nokiaMmsys slideshare-intel-nokia
Mmsys slideshare-intel-nokia
 
OpenACC Monthly Highlights: February 2022
OpenACC Monthly Highlights: February 2022OpenACC Monthly Highlights: February 2022
OpenACC Monthly Highlights: February 2022
 
OpenACC Monthly Highlights - May and June 2018
OpenACC Monthly Highlights - May and June 2018OpenACC Monthly Highlights - May and June 2018
OpenACC Monthly Highlights - May and June 2018
 
Performance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4SPerformance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4S
 
SnapLogic Extends Beyond Cloud and Big Data Integration into the Internet of ...
SnapLogic Extends Beyond Cloud and Big Data Integration into the Internet of ...SnapLogic Extends Beyond Cloud and Big Data Integration into the Internet of ...
SnapLogic Extends Beyond Cloud and Big Data Integration into the Internet of ...
 
Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418
 
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale SystemsDesigning Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
 
OpenACC Monthly Highlights: March 2021
OpenACC Monthly Highlights: March 2021OpenACC Monthly Highlights: March 2021
OpenACC Monthly Highlights: March 2021
 

More from TakuyaFukuoka2

More from TakuyaFukuoka2 (9)

業務で ISUCON することになった話.pdf
業務で ISUCON することになった話.pdf業務で ISUCON することになった話.pdf
業務で ISUCON することになった話.pdf
 
A Survey on Performance Analytical Tools for Partitioned Global Address Space
A Survey on Performance Analytical Tools for Partitioned Global Address SpaceA Survey on Performance Analytical Tools for Partitioned Global Address Space
A Survey on Performance Analytical Tools for Partitioned Global Address Space
 
Loom: flexible and efficient NIC packet scheduling
Loom: flexible and efficient NIC packet schedulingLoom: flexible and efficient NIC packet scheduling
Loom: flexible and efficient NIC packet scheduling
 
LITE Kernel RDMA Support for Datacenter Applications
LITE Kernel RDMA Support for Datacenter ApplicationsLITE Kernel RDMA Support for Datacenter Applications
LITE Kernel RDMA Support for Datacenter Applications
 
Page Fault Support for Network Controllers
Page Fault Support for Network ControllersPage Fault Support for Network Controllers
Page Fault Support for Network Controllers
 
Using RDMA Efficiently for Key-Value Services
Using RDMA Efficiently for Key-Value ServicesUsing RDMA Efficiently for Key-Value Services
Using RDMA Efficiently for Key-Value Services
 
Hyperbolic Caching: Flexible Caching for Web Applications
Hyperbolic Caching: Flexible Caching for Web ApplicationsHyperbolic Caching: Flexible Caching for Web Applications
Hyperbolic Caching: Flexible Caching for Web Applications
 
Latency-Tolerant Software Distributed Shared Memory
Latency-Tolerant Software Distributed Shared MemoryLatency-Tolerant Software Distributed Shared Memory
Latency-Tolerant Software Distributed Shared Memory
 
Tardis: Time Traveling Coherence Algorithm for Distributed Shared Memory
Tardis: Time Traveling Coherence Algorithm for Distributed Shared MemoryTardis: Time Traveling Coherence Algorithm for Distributed Shared Memory
Tardis: Time Traveling Coherence Algorithm for Distributed Shared Memory
 

Recently uploaded

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 

Recently uploaded (20)

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 

Overlapping Communication and Computation by Using a Hybrid MPI/SMPSs Approach