This document compares GPUs and CPUs for parallel processing. It explains that GPUs were designed for graphics applications which are inherently parallel, allowing them to have many thousands of cores optimized for floating point operations and fast data transfer. CPUs were designed for sequential applications and have limited cores with slower floating point performance and data transfer. While CPUs can perform more complex operations, GPUs are better suited for problems that can be broken down into the same algorithm across large datasets due to their highly parallel architecture. Both CPU and GPU are needed in complete systems, with each optimized for different types of tasks.
3. Overview
● GPUs blow CPUs out of the water when it comes to raw processing
horsepower of a specific problem set
○ “Specific”: Where computation can be whittled down to a singular algorithm that
can be applied across a wide dataset
● Why?
○ The original purpose of CPUs (and their resulting design) has led to this limitation
○ The original purpose of GPUs (and their resulting design) has made them ideal
for use in this particular application
3
4. Processor Trends
● Previously, for about 20 years, the driving factor in processor design
has been performance
○ Processor design had been targeted to provide more features and functionality to
users
○ Processor design had been driven by increased clock rate
■ “CPU arms race”
○ Fundamental CPU architecture has been developed to minimize responsiveness
of a single application run by a single user
● Since 2003, reduction in size of computational devices has shifted
focus from raw processing to energy consumption and heat dissipation
○ Battery life!
○ Resulted in vendors shifting focus from pure clock rate to the number of “cores” in
a processor
■ Core = processing element
4
5. CPU Design
● Traditionally, most CPU software was developed to behave in a
sequential manner
○ Before the advent of multiple cores that can operate in true parallel fashion,
either:
■ SW had to play tricks to make it seem that multiple applications were being
executed in parallel (relying on increasing CPU clock rates)
■ HW enhancements to make sequential processing “look” parallel (i.e.
pipelining)
● With increasing number of cores that can truly run in parallel on a
single CPU silicon die, SW developers have had to rethink app
development
○ Emphasis has been placed on parallel programs
○ But parallel development is not new!
○ Programs that truly run in parallel have been developed for decades
■ High performance computing applications
■ Run on expensive, dedicated HW
5
6. CPU Design
● The fundamental architecture of CPUs has limited the number of cores
that can exist on a single silicon die
○ Premise of CPU architecture was to (originally) optimize responsiveness of a
single application executed by a single user
○ HW design to support true parallel behavior has had to be “shoehorned”, limiting
the number of cores that is attainable
■ Maximum number of CPU cores ⇒ ~10ish
● Nature of original CPU architectures has required additions to support
efficient floating point operations
○ Again, because there was no original need to perform floating operations
efficiently
○ Required additions to the Instruction Set Architecture (ISA), and in turn
modifications to the underlying HW
○ Another alternative was to add a dedicated controller in the processor for floating
point operations (i.e. FPU)
6
7. CPU Design
● Why can’t more cores be easily included in CPU designs?
○ Over the past ~17 years (since 2003), number of HW cores has increased from 1
→ 10ish, 20-ish in CPU designs
○ Limited by the original CPU architecture, since more silicon “real estate” was
devoted to:
■ The control logic to transfer instructions and data to the core
■ The processor cache to avoid having to fetch instructions that are frequently
used
■ Goal was/has been to keep instruction and data access latencies to a
minimum
○ Unfortunately, there is less real estate available for the actual processing cores
● Transfer of data has been another issue
○ Again, because the original problem that CPUs were meant to solve didn’t involve
a significant amount of data
○ Data transfer speeds is another issue/bottleneck for faster parallel processing
7
8. (Simplified) CPU Design
* Fewer resources devoted to “actual” processing (i.e. core)
Contr
ol
Core Core
Core Core
Cach
e
8
9. GPU Design
● GPUs were originally (and still are) designed for graphics intensive
applications
● Graphics applications are inherently parallel in nature
○ Each pixel is (usually) independent of another pixel
○ The same operations are (usually) performed on each pixel
○ Each frame usually consists of 100k, 1M pixels
● Because of the nature of the problem that GPUs were originally meant
to solve, they have become ideal candidates for highly parallel,
non-graphics applications
○ Machine Learning
○ Artificial Intelligence
○ Data Science
9
10. GPU Design
● The fundamental problem that GPUs were meant to solve has allowed
for many more cores to be easily added over the years
○ Don’t really care about responsiveness of a single application but rather the
overall execution throughput
■ Gamer doesn’t care about how long it takes for a particular pixel to be
rendered, but rather an entire frame
■ A video editor doesn’t care about how long it takes for a particular pixel (or
even frame) to be processed, but rather an entire video
○ “Manycore” computing device vs CPU-based “multi-core” computing device
■ Manycore: 10k, 100k, 1M cores
■ Multi-core: Single, double-digit cores
● Nature of graphics applications resulted in native support for fast
floating point operations in GPUs
○ Ray-tracing, 2D, 3D graphics inherently must be done using floating point
numbers
○ HW was designed to support optimal floating point operations
10
11. GPU Design
● Due to original problem that GPUs were meant to solve, adding more
cores is much easier than on a CPU
○ Increase in number of cores in a GPU is by orders of magnitude year-over-year
(e.g. 10x)
○ GPU architecture allows for fast execution of instructions on a large dataset in
parallel
○ More silicon “real estate” devoted to the processing cores themselves vs control
logic to transfer instructions and data
● GPU Architecture was developed to allow for transfer of large datasets
○ Graphics processing involves transferring a ton of data at once (e.g. individual
frame of pixels)
○ Memory was optimized to NOT be a bottleneck
11
13. Comparison
Category CPU GPU
Number of cores Few (10s, 100s (maybe?)) Many (10k, 100k, 1M)
Capability of each core Can perform more complex
operations
Can perform simpler
operations
Floating Point Support Added later (either via
modifications to the ISA or
with a dedicated FPU)
Native support in
computation core
Memory Transfer Slower and much more
frequent (can use cache to
alleviate this)
Faster and much less
frequent (usually transfer
large dataset between
system memory and GPU
memory “at once”)
SW Development Effort Simpler Complex (requires dataset
to be structured a certain
way and have to write SW a
particular way to leverage
HW)
13
14. Summary/Follow-Up
● CPUs are the optimal choice for one set of problems and GPUs are
the optimal choice for another set of problems
● Can’t use a single processor type
● Need to use both in a complete system
○ Even in a GPU-based system, need file-transfer, network operations, etc.. which
are ideally suited for a CPU
● Follow-Up
○ How to implement a simple algorithm on an Nvidia GPU using CUDA C
■ Discuss the challenges that are usually associated with such a task
● Data structure
● Core interactions
● Data transfer from system memory to GPU memory
■ CUDA C ⇒ Extension of the C language to support optimal operations on
an Nvidia GPU
14