SlideShare a Scribd company logo
1 of 14
Download to read offline
GPUs vs CPUs For
Parallel Processing
Mohammed Billoo
MAB Labs, LLC
Outline
● Overview
● Processor Trends
● CPU Design
● GPU Design
● Comparison
● Summary/Follow-Up
2
Overview
● GPUs blow CPUs out of the water when it comes to raw processing
horsepower of a specific problem set
○ “Specific”: Where computation can be whittled down to a singular algorithm that
can be applied across a wide dataset
● Why?
○ The original purpose of CPUs (and their resulting design) has led to this limitation
○ The original purpose of GPUs (and their resulting design) has made them ideal
for use in this particular application
3
Processor Trends
● Previously, for about 20 years, the driving factor in processor design
has been performance
○ Processor design had been targeted to provide more features and functionality to
users
○ Processor design had been driven by increased clock rate
■ “CPU arms race”
○ Fundamental CPU architecture has been developed to minimize responsiveness
of a single application run by a single user
● Since 2003, reduction in size of computational devices has shifted
focus from raw processing to energy consumption and heat dissipation
○ Battery life!
○ Resulted in vendors shifting focus from pure clock rate to the number of “cores” in
a processor
■ Core = processing element
4
CPU Design
● Traditionally, most CPU software was developed to behave in a
sequential manner
○ Before the advent of multiple cores that can operate in true parallel fashion,
either:
■ SW had to play tricks to make it seem that multiple applications were being
executed in parallel (relying on increasing CPU clock rates)
■ HW enhancements to make sequential processing “look” parallel (i.e.
pipelining)
● With increasing number of cores that can truly run in parallel on a
single CPU silicon die, SW developers have had to rethink app
development
○ Emphasis has been placed on parallel programs
○ But parallel development is not new!
○ Programs that truly run in parallel have been developed for decades
■ High performance computing applications
■ Run on expensive, dedicated HW
5
CPU Design
● The fundamental architecture of CPUs has limited the number of cores
that can exist on a single silicon die
○ Premise of CPU architecture was to (originally) optimize responsiveness of a
single application executed by a single user
○ HW design to support true parallel behavior has had to be “shoehorned”, limiting
the number of cores that is attainable
■ Maximum number of CPU cores ⇒ ~10ish
● Nature of original CPU architectures has required additions to support
efficient floating point operations
○ Again, because there was no original need to perform floating operations
efficiently
○ Required additions to the Instruction Set Architecture (ISA), and in turn
modifications to the underlying HW
○ Another alternative was to add a dedicated controller in the processor for floating
point operations (i.e. FPU)
6
CPU Design
● Why can’t more cores be easily included in CPU designs?
○ Over the past ~17 years (since 2003), number of HW cores has increased from 1
→ 10ish, 20-ish in CPU designs
○ Limited by the original CPU architecture, since more silicon “real estate” was
devoted to:
■ The control logic to transfer instructions and data to the core
■ The processor cache to avoid having to fetch instructions that are frequently
used
■ Goal was/has been to keep instruction and data access latencies to a
minimum
○ Unfortunately, there is less real estate available for the actual processing cores
● Transfer of data has been another issue
○ Again, because the original problem that CPUs were meant to solve didn’t involve
a significant amount of data
○ Data transfer speeds is another issue/bottleneck for faster parallel processing
7
(Simplified) CPU Design
* Fewer resources devoted to “actual” processing (i.e. core)
Contr
ol
Core Core
Core Core
Cach
e
8
GPU Design
● GPUs were originally (and still are) designed for graphics intensive
applications
● Graphics applications are inherently parallel in nature
○ Each pixel is (usually) independent of another pixel
○ The same operations are (usually) performed on each pixel
○ Each frame usually consists of 100k, 1M pixels
● Because of the nature of the problem that GPUs were originally meant
to solve, they have become ideal candidates for highly parallel,
non-graphics applications
○ Machine Learning
○ Artificial Intelligence
○ Data Science
9
GPU Design
● The fundamental problem that GPUs were meant to solve has allowed
for many more cores to be easily added over the years
○ Don’t really care about responsiveness of a single application but rather the
overall execution throughput
■ Gamer doesn’t care about how long it takes for a particular pixel to be
rendered, but rather an entire frame
■ A video editor doesn’t care about how long it takes for a particular pixel (or
even frame) to be processed, but rather an entire video
○ “Manycore” computing device vs CPU-based “multi-core” computing device
■ Manycore: 10k, 100k, 1M cores
■ Multi-core: Single, double-digit cores
● Nature of graphics applications resulted in native support for fast
floating point operations in GPUs
○ Ray-tracing, 2D, 3D graphics inherently must be done using floating point
numbers
○ HW was designed to support optimal floating point operations
10
GPU Design
● Due to original problem that GPUs were meant to solve, adding more
cores is much easier than on a CPU
○ Increase in number of cores in a GPU is by orders of magnitude year-over-year
(e.g. 10x)
○ GPU architecture allows for fast execution of instructions on a large dataset in
parallel
○ More silicon “real estate” devoted to the processing cores themselves vs control
logic to transfer instructions and data
● GPU Architecture was developed to allow for transfer of large datasets
○ Graphics processing involves transferring a ton of data at once (e.g. individual
frame of pixels)
○ Memory was optimized to NOT be a bottleneck
11
(Simplified) GPU Design
Control
Core CoreCore
Cache
. . . . . . . . . .
.
Control
Core CoreCore . . . . . . . . . .
.
Control
Core CoreCore . . . . . . . . . .
.
Control
Core CoreCore . . . . . . . . . .
.
* More processor resources devoted to cores
12
Cache
Cache
Cache
Comparison
Category CPU GPU
Number of cores Few (10s, 100s (maybe?)) Many (10k, 100k, 1M)
Capability of each core Can perform more complex
operations
Can perform simpler
operations
Floating Point Support Added later (either via
modifications to the ISA or
with a dedicated FPU)
Native support in
computation core
Memory Transfer Slower and much more
frequent (can use cache to
alleviate this)
Faster and much less
frequent (usually transfer
large dataset between
system memory and GPU
memory “at once”)
SW Development Effort Simpler Complex (requires dataset
to be structured a certain
way and have to write SW a
particular way to leverage
HW)
13
Summary/Follow-Up
● CPUs are the optimal choice for one set of problems and GPUs are
the optimal choice for another set of problems
● Can’t use a single processor type
● Need to use both in a complete system
○ Even in a GPU-based system, need file-transfer, network operations, etc.. which
are ideally suited for a CPU
● Follow-Up
○ How to implement a simple algorithm on an Nvidia GPU using CUDA C
■ Discuss the challenges that are usually associated with such a task
● Data structure
● Core interactions
● Data transfer from system memory to GPU memory
■ CUDA C ⇒ Extension of the C language to support optimal operations on
an Nvidia GPU
14

More Related Content

What's hot

GPU Computing: A brief overview
GPU Computing: A brief overviewGPU Computing: A brief overview
GPU Computing: A brief overviewRajiv Kumar
 
Warehouse scale computer
Warehouse scale computerWarehouse scale computer
Warehouse scale computerHassan A-j
 
An introduction to the Design of Warehouse-Scale Computers
An introduction to the Design of Warehouse-Scale ComputersAn introduction to the Design of Warehouse-Scale Computers
An introduction to the Design of Warehouse-Scale ComputersAlessio Villardita
 
Hardware-aware thread scheduling: the case of asymmetric multicore processors
Hardware-aware thread scheduling: the case of asymmetric multicore processorsHardware-aware thread scheduling: the case of asymmetric multicore processors
Hardware-aware thread scheduling: the case of asymmetric multicore processorsAchille Peternier
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An IntroductionDhan V Sagar
 
Stream Processing
Stream ProcessingStream Processing
Stream Processingarnamoy10
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Saksham Tanwar
 
Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1Gao Boyang
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit pptNitesh Dubey
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit pptSandeep Singh
 
Dad i want a supercomputer on my next
Dad i want a supercomputer on my nextDad i want a supercomputer on my next
Dad i want a supercomputer on my nextAkash Sahoo
 
GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)self employed
 

What's hot (19)

Danish presentation
Danish presentationDanish presentation
Danish presentation
 
GPU Computing: A brief overview
GPU Computing: A brief overviewGPU Computing: A brief overview
GPU Computing: A brief overview
 
Warehouse scale computer
Warehouse scale computerWarehouse scale computer
Warehouse scale computer
 
An introduction to the Design of Warehouse-Scale Computers
An introduction to the Design of Warehouse-Scale ComputersAn introduction to the Design of Warehouse-Scale Computers
An introduction to the Design of Warehouse-Scale Computers
 
VM-aware Adaptive Storage Cache Prefetching
VM-aware Adaptive Storage Cache PrefetchingVM-aware Adaptive Storage Cache Prefetching
VM-aware Adaptive Storage Cache Prefetching
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
Hardware-aware thread scheduling: the case of asymmetric multicore processors
Hardware-aware thread scheduling: the case of asymmetric multicore processorsHardware-aware thread scheduling: the case of asymmetric multicore processors
Hardware-aware thread scheduling: the case of asymmetric multicore processors
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
Graphics processing unit
Graphics processing unitGraphics processing unit
Graphics processing unit
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1
 
BMCArmor: A Hardware Protection Scheme for Bare-metal Clouds
BMCArmor: A Hardware Protection Scheme for Bare-metal CloudsBMCArmor: A Hardware Protection Scheme for Bare-metal Clouds
BMCArmor: A Hardware Protection Scheme for Bare-metal Clouds
 
The Quick Migration of File Servers
The Quick Migration of File ServersThe Quick Migration of File Servers
The Quick Migration of File Servers
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit ppt
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
Dad i want a supercomputer on my next
Dad i want a supercomputer on my nextDad i want a supercomputer on my next
Dad i want a supercomputer on my next
 
GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)
 

Similar to GPUs vs CPUs for Parallel Processing

High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer ArchitectureSubhasis Dash
 
Ca lecture 03
Ca lecture 03Ca lecture 03
Ca lecture 03Haris456
 
Computer architecture multi core processor
Computer architecture multi core processorComputer architecture multi core processor
Computer architecture multi core processorMazin Alwaaly
 
Intel new processors
Intel new processorsIntel new processors
Intel new processorszaid_b
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 
Throughput oriented aarchitectures
Throughput oriented aarchitecturesThroughput oriented aarchitectures
Throughput oriented aarchitecturesNomy059
 
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdffinaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdfNazarAhmadAlkhidir
 
Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)Sudip Roy
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Ankit Gupta
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADDesign World
 
trends of microprocessor field
trends of microprocessor fieldtrends of microprocessor field
trends of microprocessor fieldRamya SK
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Uni Processor Architecture
Uni Processor ArchitectureUni Processor Architecture
Uni Processor ArchitectureAshish KC
 
Multi core processors
Multi core processorsMulti core processors
Multi core processorsNipun Sharma
 
Trends in computer architecture
Trends in computer architectureTrends in computer architecture
Trends in computer architecturemuhammedsalihabbas
 
CA presentation of multicore processor
CA presentation of multicore processorCA presentation of multicore processor
CA presentation of multicore processorZeeshan Aslam
 

Similar to GPUs vs CPUs for Parallel Processing (20)

module01.ppt
module01.pptmodule01.ppt
module01.ppt
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
Ca lecture 03
Ca lecture 03Ca lecture 03
Ca lecture 03
 
Computer architecture multi core processor
Computer architecture multi core processorComputer architecture multi core processor
Computer architecture multi core processor
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
 
processor struct
processor structprocessor struct
processor struct
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
Throughput oriented aarchitectures
Throughput oriented aarchitecturesThroughput oriented aarchitectures
Throughput oriented aarchitectures
 
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdffinaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
 
Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)Multi_Core_Processor_2015_(Download it!)
Multi_Core_Processor_2015_(Download it!)
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CAD
 
trends of microprocessor field
trends of microprocessor fieldtrends of microprocessor field
trends of microprocessor field
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Uni Processor Architecture
Uni Processor ArchitectureUni Processor Architecture
Uni Processor Architecture
 
Multi core processors
Multi core processorsMulti core processors
Multi core processors
 
Trends in computer architecture
Trends in computer architectureTrends in computer architecture
Trends in computer architecture
 
CA presentation of multicore processor
CA presentation of multicore processorCA presentation of multicore processor
CA presentation of multicore processor
 

Recently uploaded

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 

Recently uploaded (20)

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 

GPUs vs CPUs for Parallel Processing

  • 1. GPUs vs CPUs For Parallel Processing Mohammed Billoo MAB Labs, LLC
  • 2. Outline ● Overview ● Processor Trends ● CPU Design ● GPU Design ● Comparison ● Summary/Follow-Up 2
  • 3. Overview ● GPUs blow CPUs out of the water when it comes to raw processing horsepower of a specific problem set ○ “Specific”: Where computation can be whittled down to a singular algorithm that can be applied across a wide dataset ● Why? ○ The original purpose of CPUs (and their resulting design) has led to this limitation ○ The original purpose of GPUs (and their resulting design) has made them ideal for use in this particular application 3
  • 4. Processor Trends ● Previously, for about 20 years, the driving factor in processor design has been performance ○ Processor design had been targeted to provide more features and functionality to users ○ Processor design had been driven by increased clock rate ■ “CPU arms race” ○ Fundamental CPU architecture has been developed to minimize responsiveness of a single application run by a single user ● Since 2003, reduction in size of computational devices has shifted focus from raw processing to energy consumption and heat dissipation ○ Battery life! ○ Resulted in vendors shifting focus from pure clock rate to the number of “cores” in a processor ■ Core = processing element 4
  • 5. CPU Design ● Traditionally, most CPU software was developed to behave in a sequential manner ○ Before the advent of multiple cores that can operate in true parallel fashion, either: ■ SW had to play tricks to make it seem that multiple applications were being executed in parallel (relying on increasing CPU clock rates) ■ HW enhancements to make sequential processing “look” parallel (i.e. pipelining) ● With increasing number of cores that can truly run in parallel on a single CPU silicon die, SW developers have had to rethink app development ○ Emphasis has been placed on parallel programs ○ But parallel development is not new! ○ Programs that truly run in parallel have been developed for decades ■ High performance computing applications ■ Run on expensive, dedicated HW 5
  • 6. CPU Design ● The fundamental architecture of CPUs has limited the number of cores that can exist on a single silicon die ○ Premise of CPU architecture was to (originally) optimize responsiveness of a single application executed by a single user ○ HW design to support true parallel behavior has had to be “shoehorned”, limiting the number of cores that is attainable ■ Maximum number of CPU cores ⇒ ~10ish ● Nature of original CPU architectures has required additions to support efficient floating point operations ○ Again, because there was no original need to perform floating operations efficiently ○ Required additions to the Instruction Set Architecture (ISA), and in turn modifications to the underlying HW ○ Another alternative was to add a dedicated controller in the processor for floating point operations (i.e. FPU) 6
  • 7. CPU Design ● Why can’t more cores be easily included in CPU designs? ○ Over the past ~17 years (since 2003), number of HW cores has increased from 1 → 10ish, 20-ish in CPU designs ○ Limited by the original CPU architecture, since more silicon “real estate” was devoted to: ■ The control logic to transfer instructions and data to the core ■ The processor cache to avoid having to fetch instructions that are frequently used ■ Goal was/has been to keep instruction and data access latencies to a minimum ○ Unfortunately, there is less real estate available for the actual processing cores ● Transfer of data has been another issue ○ Again, because the original problem that CPUs were meant to solve didn’t involve a significant amount of data ○ Data transfer speeds is another issue/bottleneck for faster parallel processing 7
  • 8. (Simplified) CPU Design * Fewer resources devoted to “actual” processing (i.e. core) Contr ol Core Core Core Core Cach e 8
  • 9. GPU Design ● GPUs were originally (and still are) designed for graphics intensive applications ● Graphics applications are inherently parallel in nature ○ Each pixel is (usually) independent of another pixel ○ The same operations are (usually) performed on each pixel ○ Each frame usually consists of 100k, 1M pixels ● Because of the nature of the problem that GPUs were originally meant to solve, they have become ideal candidates for highly parallel, non-graphics applications ○ Machine Learning ○ Artificial Intelligence ○ Data Science 9
  • 10. GPU Design ● The fundamental problem that GPUs were meant to solve has allowed for many more cores to be easily added over the years ○ Don’t really care about responsiveness of a single application but rather the overall execution throughput ■ Gamer doesn’t care about how long it takes for a particular pixel to be rendered, but rather an entire frame ■ A video editor doesn’t care about how long it takes for a particular pixel (or even frame) to be processed, but rather an entire video ○ “Manycore” computing device vs CPU-based “multi-core” computing device ■ Manycore: 10k, 100k, 1M cores ■ Multi-core: Single, double-digit cores ● Nature of graphics applications resulted in native support for fast floating point operations in GPUs ○ Ray-tracing, 2D, 3D graphics inherently must be done using floating point numbers ○ HW was designed to support optimal floating point operations 10
  • 11. GPU Design ● Due to original problem that GPUs were meant to solve, adding more cores is much easier than on a CPU ○ Increase in number of cores in a GPU is by orders of magnitude year-over-year (e.g. 10x) ○ GPU architecture allows for fast execution of instructions on a large dataset in parallel ○ More silicon “real estate” devoted to the processing cores themselves vs control logic to transfer instructions and data ● GPU Architecture was developed to allow for transfer of large datasets ○ Graphics processing involves transferring a ton of data at once (e.g. individual frame of pixels) ○ Memory was optimized to NOT be a bottleneck 11
  • 12. (Simplified) GPU Design Control Core CoreCore Cache . . . . . . . . . . . Control Core CoreCore . . . . . . . . . . . Control Core CoreCore . . . . . . . . . . . Control Core CoreCore . . . . . . . . . . . * More processor resources devoted to cores 12 Cache Cache Cache
  • 13. Comparison Category CPU GPU Number of cores Few (10s, 100s (maybe?)) Many (10k, 100k, 1M) Capability of each core Can perform more complex operations Can perform simpler operations Floating Point Support Added later (either via modifications to the ISA or with a dedicated FPU) Native support in computation core Memory Transfer Slower and much more frequent (can use cache to alleviate this) Faster and much less frequent (usually transfer large dataset between system memory and GPU memory “at once”) SW Development Effort Simpler Complex (requires dataset to be structured a certain way and have to write SW a particular way to leverage HW) 13
  • 14. Summary/Follow-Up ● CPUs are the optimal choice for one set of problems and GPUs are the optimal choice for another set of problems ● Can’t use a single processor type ● Need to use both in a complete system ○ Even in a GPU-based system, need file-transfer, network operations, etc.. which are ideally suited for a CPU ● Follow-Up ○ How to implement a simple algorithm on an Nvidia GPU using CUDA C ■ Discuss the challenges that are usually associated with such a task ● Data structure ● Core interactions ● Data transfer from system memory to GPU memory ■ CUDA C ⇒ Extension of the C language to support optimal operations on an Nvidia GPU 14