SlideShare a Scribd company logo
1 of 18
Programming Multiple Devices Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts), and the tradeoffs associated with each approach The lecture concludes with a quick discussion of heterogeneous load-balancing issues when working with multiple devices 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Approaches to Multiple Devices Single context, multiple devices Standard way to work with multiple devices in OpenCL Multiple contexts, multiple devices Computing on a cluster, multiple systems, etc. Considerations for CPU-GPU heterogeneous computing 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Single Context, Multiple Devices Nomenclature: “clEnqueue*” is used to describe any of the clEnqueue commands (i.e., those that interact with a device) E.g. clEnqueueNDRangeKernel(), clEnqueueReadImage() “clEnqueueRead*” and “clEnqueueWrite*” are used to describe reading/writing to either buffers or images E.g. clEnqueueReadBuffer(), clEnqueueWriteImage() 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Single Context, Multiple Devices Associating specific devices with a context is done by passing a list of the desired devices to clCreateContext() The call clCreateContextFromType() takes a device type (or combination of types) as a parameter and creates a context with all devices of that type: 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Single Context, Multiple Devices When multiple devices are part of the same context, most OpenCL objects are shared Memory objects, programs, kernels, etc. One command queue must exist per device and is supplied in OpenCL when the target GPU needs to be specified Any clEnqueue* function takes a command queue as an argument Context 6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Single Context, Multiple Devices While memory objects are common to a context, they must be explicitly written to a device before being used  Whether or not the same object can be valid on multiple devices is vendor specific OpenCL does not assume that data can be transferred directly between devices, so commands only exists to move from a host to device, or device to host Copying from one device to another requires an intermediate transfer to the host Context 2) Object now valid on host 1) clEnqueueRead*(cq0, ...) copies object to host 3) clEnqueueWrite*(cq1, ...) copies object to device 1 0) Object starts on device 0 4) Object ends up on device 1 TWO PCIe DATA TRANSFERS ARE REQUIRED 7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Single Context, Multiple Devices The behavior of a memory object written to multiple devices is vendor-specific OpenCL does not define if a copy of the object is made or whether the object remains valid once written to a device We can imagine that a CPU would operate on a memory object in-place, while a GPU would make a copy (so the original would still be valid until it is explicitly written over) Fusion GPUs from AMD could potentially operate on data in-place as well Currently AMD/NVIDIA implementations allow an object to be copied to multiple devices (even if the object will be written to) When data is read back, separate host pointers must be supplied or one set of results will be clobbered Context When writing data to a GPU, a copy is made, so multiple writes are valid  clEnqueueWrite*(cq0, ...) clEnqueueWrite*(cq1, ...) 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Single Context, Multiple Devices Just like writing a multi-threaded CPU program, we have two choices for designing multi-GPU programs Redundantly copy all data and index using global offsets Split the data into subsets and index into the subset GPU 1 GPU 0 A A Threads 0 1 2 3 4 5 6 7 GPU 1 GPU 0 A0 A1 Threads 0 1 2 3 0 1 2 3 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Single Context, Multiple Devices OpenCL provides mechanisms to help with both multi-device techniques clEnqueueNDRangeKernel() optionally takes offsets that are used when computing the global ID of a thread Note that for this technique to work, any objects that are written to will have to be synchronized manually SubBufferswere introduced in OpenCL 1.1 to allow a buffer to be split into multiple objects This allows reading/writing to offsets within a buffer to avoid manually splitting and recombining data 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Single Context, Multiple Devices OpenCLevents are used to synchronize execution on different devices within a context Each clEnqueue* function generates an event that identifies the operation Each clEnqueue* function also takes an optional list of events that must complete before that operation should occur clEnqueueWaitForEvents() is the specific call to wait for a list of events to complete Events are also used for profiling and were covered in more detail in Lecture 11 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Multiple Contexts, Multiple Devices An alternative approach is to create a redundant OpenCL context (with associated objects) per device Perhaps is an easier way to split data (based on the algorithm) Would not have to worry about coding for a variable number of devices Could use CPU-based synchronization primitives (such as locks, barriers, etc.) Communicate using host-based libraries Context Context 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Follows SPMD model more closely CUDA/C’s runtime-API approach to multi-device code No code required to consider explicitly moving data between a variable number of devices Using functions such as scatter/gather, broadcast, etc. may be easier than creating subbuffers, etc. for a variable number of devices Supports distributed programming If a distributed framework such as MPI is used for communication, programs can be ran on multi-device machines or in distributed environments Multiple Contexts, Multiple Devices 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
In addition to PCI-Express transfers required to move data between host and device, extra memory and network communication may be required Host libraries (e.g., pthreads, MPI) must be used for synchronization and communication Multiple Contexts, Multiple Devices 14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Heterogeneous Computing Targeting heterogeneous devices (e.g., CPUs and GPUs at the same time) requires awareness of their different performance characteristics for an application To generalize: Context *otherwise application wouldn’t use OpenCL 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Heterogeneous Computing Factors to consider Scheduling overhead  What is the startup time of each device? Location of data  Which device is the data currently resident on? Data must be transferred across the PCI-Express bus Granularity of workloads How should the problem be divided? What is the ratio of startup time to actual work Execution performance relative to other devices How should the work be distributed? 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Heterogeneous Computing Granularity of scheduling units must be weighed Workload sizes that are too large may execute slowly on a device, stalling overall completion Workload sizes that are too small may be dominated by startup overhead Approach to load-balancing #1: Begin scheduling small workload sizes Profile execution times on each device  Extrapolate execution profiles for larger workload sizes Schedule with larger workload sizes to avoid unnecessary overhead	 Approach to load-balancing #2: If one device is much faster than anything else in the system, just run on that device 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Summary There are different approaches to multi-device programming Single context, multiple devices Can only communicate with devices recognized by one vendor Code must be written for a general number of devices Multiple contexts, multiple devices More like distributed programming Code can be written for a single device (or multiple devices), with explicit movement of data between contexts 18 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

More Related Content

What's hot

Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowEmanuel Di Nardo
 
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaA Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaHiroki Nakahara
 
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed ObjectsFCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed ObjectsKusano Hitoshi
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureMani Goswami
 
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...Hiroki Nakahara
 
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...Big Data Spain
 
Example uses of gpu compute models
Example uses of gpu compute modelsExample uses of gpu compute models
Example uses of gpu compute modelsPedram Mazloom
 
Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13specialk29
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learningAmgad Muhammad
 
Experiences of numerical simulations on a PC cluster
Experiences of numerical simulations on a PC clusterExperiences of numerical simulations on a PC cluster
Experiences of numerical simulations on a PC clusterAntti Vanne
 
Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1paul0001
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Yulia Tsisyk
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
Salt Identification Challenge
Salt Identification ChallengeSalt Identification Challenge
Salt Identification Challengekenluck2001
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewPoo Kuan Hoong
 

What's hot (20)

Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
 
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaA Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGa
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed ObjectsFCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...
 
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
 
Task and Data Parallelism
Task and Data ParallelismTask and Data Parallelism
Task and Data Parallelism
 
Example uses of gpu compute models
Example uses of gpu compute modelsExample uses of gpu compute models
Example uses of gpu compute models
 
Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
Experiences of numerical simulations on a PC cluster
Experiences of numerical simulations on a PC clusterExperiences of numerical simulations on a PC cluster
Experiences of numerical simulations on a PC cluster
 
Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
 
Green scheduling
Green schedulingGreen scheduling
Green scheduling
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Salt Identification Challenge
Salt Identification ChallengeSalt Identification Challenge
Salt Identification Challenge
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
 

Similar to Lec13 multidevice

Seastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration SummitSeastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration SummitDon Marti
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8AbdullahMunir32
 
Survey of open source cloud architectures
Survey of open source cloud architecturesSurvey of open source cloud architectures
Survey of open source cloud architecturesabhinav vedanbhatla
 
At the Crossroads of HPC and Cloud Computing with Openstack
At the Crossroads of HPC and Cloud Computing with OpenstackAt the Crossroads of HPC and Cloud Computing with Openstack
At the Crossroads of HPC and Cloud Computing with OpenstackRyan Aydelott
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...Dr. Thippeswamy S.
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task ComputingEric Van Hensbergen
 
Engineer Engineering Software
Engineer Engineering SoftwareEngineer Engineering Software
Engineer Engineering SoftwareYung-Yu Chen
 
Lec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdfLec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdfsamaghorab
 
ZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed SystemsZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed SystemsGokhan Boranalp
 
Computer_Clustering_Technologies
Computer_Clustering_TechnologiesComputer_Clustering_Technologies
Computer_Clustering_TechnologiesManish Chopra
 
Openstack_administration
Openstack_administrationOpenstack_administration
Openstack_administrationAshish Sharma
 
parallel programming models
 parallel programming models parallel programming models
parallel programming modelsSwetha S
 
Clustering by AKASHMSHAH
Clustering by AKASHMSHAHClustering by AKASHMSHAH
Clustering by AKASHMSHAHAkash M Shah
 

Similar to Lec13 multidevice (20)

Seastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration SummitSeastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration Summit
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
Survey of open source cloud architectures
Survey of open source cloud architecturesSurvey of open source cloud architectures
Survey of open source cloud architectures
 
At the Crossroads of HPC and Cloud Computing with Openstack
At the Crossroads of HPC and Cloud Computing with OpenstackAt the Crossroads of HPC and Cloud Computing with Openstack
At the Crossroads of HPC and Cloud Computing with Openstack
 
Clone cloud
Clone cloudClone cloud
Clone cloud
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task Computing
 
Engineer Engineering Software
Engineer Engineering SoftwareEngineer Engineering Software
Engineer Engineering Software
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Lec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdfLec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdf
 
ZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed SystemsZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed Systems
 
Computer_Clustering_Technologies
Computer_Clustering_TechnologiesComputer_Clustering_Technologies
Computer_Clustering_Technologies
 
Parallel computing persentation
Parallel computing persentationParallel computing persentation
Parallel computing persentation
 
Openstack_administration
Openstack_administrationOpenstack_administration
Openstack_administration
 
Cluster computer
Cluster  computerCluster  computer
Cluster computer
 
Underlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computingUnderlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computing
 
parallel programming models
 parallel programming models parallel programming models
parallel programming models
 
Clustering by AKASHMSHAH
Clustering by AKASHMSHAHClustering by AKASHMSHAH
Clustering by AKASHMSHAH
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Lec13 multidevice

  • 1. Programming Multiple Devices Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
  • 2. Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts), and the tradeoffs associated with each approach The lecture concludes with a quick discussion of heterogeneous load-balancing issues when working with multiple devices 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 3. Approaches to Multiple Devices Single context, multiple devices Standard way to work with multiple devices in OpenCL Multiple contexts, multiple devices Computing on a cluster, multiple systems, etc. Considerations for CPU-GPU heterogeneous computing 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 4. Single Context, Multiple Devices Nomenclature: “clEnqueue*” is used to describe any of the clEnqueue commands (i.e., those that interact with a device) E.g. clEnqueueNDRangeKernel(), clEnqueueReadImage() “clEnqueueRead*” and “clEnqueueWrite*” are used to describe reading/writing to either buffers or images E.g. clEnqueueReadBuffer(), clEnqueueWriteImage() 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 5. Single Context, Multiple Devices Associating specific devices with a context is done by passing a list of the desired devices to clCreateContext() The call clCreateContextFromType() takes a device type (or combination of types) as a parameter and creates a context with all devices of that type: 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 6. Single Context, Multiple Devices When multiple devices are part of the same context, most OpenCL objects are shared Memory objects, programs, kernels, etc. One command queue must exist per device and is supplied in OpenCL when the target GPU needs to be specified Any clEnqueue* function takes a command queue as an argument Context 6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 7. Single Context, Multiple Devices While memory objects are common to a context, they must be explicitly written to a device before being used Whether or not the same object can be valid on multiple devices is vendor specific OpenCL does not assume that data can be transferred directly between devices, so commands only exists to move from a host to device, or device to host Copying from one device to another requires an intermediate transfer to the host Context 2) Object now valid on host 1) clEnqueueRead*(cq0, ...) copies object to host 3) clEnqueueWrite*(cq1, ...) copies object to device 1 0) Object starts on device 0 4) Object ends up on device 1 TWO PCIe DATA TRANSFERS ARE REQUIRED 7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 8. Single Context, Multiple Devices The behavior of a memory object written to multiple devices is vendor-specific OpenCL does not define if a copy of the object is made or whether the object remains valid once written to a device We can imagine that a CPU would operate on a memory object in-place, while a GPU would make a copy (so the original would still be valid until it is explicitly written over) Fusion GPUs from AMD could potentially operate on data in-place as well Currently AMD/NVIDIA implementations allow an object to be copied to multiple devices (even if the object will be written to) When data is read back, separate host pointers must be supplied or one set of results will be clobbered Context When writing data to a GPU, a copy is made, so multiple writes are valid clEnqueueWrite*(cq0, ...) clEnqueueWrite*(cq1, ...) 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 9. Single Context, Multiple Devices Just like writing a multi-threaded CPU program, we have two choices for designing multi-GPU programs Redundantly copy all data and index using global offsets Split the data into subsets and index into the subset GPU 1 GPU 0 A A Threads 0 1 2 3 4 5 6 7 GPU 1 GPU 0 A0 A1 Threads 0 1 2 3 0 1 2 3 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 10. Single Context, Multiple Devices OpenCL provides mechanisms to help with both multi-device techniques clEnqueueNDRangeKernel() optionally takes offsets that are used when computing the global ID of a thread Note that for this technique to work, any objects that are written to will have to be synchronized manually SubBufferswere introduced in OpenCL 1.1 to allow a buffer to be split into multiple objects This allows reading/writing to offsets within a buffer to avoid manually splitting and recombining data 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 11. Single Context, Multiple Devices OpenCLevents are used to synchronize execution on different devices within a context Each clEnqueue* function generates an event that identifies the operation Each clEnqueue* function also takes an optional list of events that must complete before that operation should occur clEnqueueWaitForEvents() is the specific call to wait for a list of events to complete Events are also used for profiling and were covered in more detail in Lecture 11 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 12. Multiple Contexts, Multiple Devices An alternative approach is to create a redundant OpenCL context (with associated objects) per device Perhaps is an easier way to split data (based on the algorithm) Would not have to worry about coding for a variable number of devices Could use CPU-based synchronization primitives (such as locks, barriers, etc.) Communicate using host-based libraries Context Context 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 13. Follows SPMD model more closely CUDA/C’s runtime-API approach to multi-device code No code required to consider explicitly moving data between a variable number of devices Using functions such as scatter/gather, broadcast, etc. may be easier than creating subbuffers, etc. for a variable number of devices Supports distributed programming If a distributed framework such as MPI is used for communication, programs can be ran on multi-device machines or in distributed environments Multiple Contexts, Multiple Devices 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 14. In addition to PCI-Express transfers required to move data between host and device, extra memory and network communication may be required Host libraries (e.g., pthreads, MPI) must be used for synchronization and communication Multiple Contexts, Multiple Devices 14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 15. Heterogeneous Computing Targeting heterogeneous devices (e.g., CPUs and GPUs at the same time) requires awareness of their different performance characteristics for an application To generalize: Context *otherwise application wouldn’t use OpenCL 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 16. Heterogeneous Computing Factors to consider Scheduling overhead What is the startup time of each device? Location of data Which device is the data currently resident on? Data must be transferred across the PCI-Express bus Granularity of workloads How should the problem be divided? What is the ratio of startup time to actual work Execution performance relative to other devices How should the work be distributed? 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 17. Heterogeneous Computing Granularity of scheduling units must be weighed Workload sizes that are too large may execute slowly on a device, stalling overall completion Workload sizes that are too small may be dominated by startup overhead Approach to load-balancing #1: Begin scheduling small workload sizes Profile execution times on each device Extrapolate execution profiles for larger workload sizes Schedule with larger workload sizes to avoid unnecessary overhead Approach to load-balancing #2: If one device is much faster than anything else in the system, just run on that device 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 18. Summary There are different approaches to multi-device programming Single context, multiple devices Can only communicate with devices recognized by one vendor Code must be written for a general number of devices Multiple contexts, multiple devices More like distributed programming Code can be written for a single device (or multiple devices), with explicit movement of data between contexts 18 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Editor's Notes

  1. If CPUs were being used and operating on data in-place, the commands to transfer data would likely not actually do any copying.
  2. If the contexts were on the same physical machine, pthreads could be used to communicate and synchronize. In a distributed setting, MPI could be used.