1
PRAGMATIC
OPTIMIZATION
IN MODERN PROGRAMMING
MODERN COMPUTER ARCHITECTURE CONCEPTS
Created by for / 2015-2016Marina (geek) Kolpakova UNN
2
COURSE TOPICS
Ordering optimization approaches
Demystifying a compiler
Mastering compiler optimizations
Modern computer architectures concepts
3
OUTLINE
Three aspects of the computer architecture
Latency vs Throughput architectures
Architecture families
CISC
RISC
VLIW
Vector
Why is it doing to be load/store?
Latest trends
summary
4 . 1
1-ST ASPECT OF COMPUTER ARCHITECTURE
Instruction Set Architecture or ISA (interface)
is a contract between HW and SW,
which speci es right, possibilities & limitations.
Class of ISA (load-store, register-memory)
Memory addressing modes & rules (base-immediate,
alignment requirements)
Types & sizes of operands (size of byte, short)
Operations (general arithmetic, control, logical)
Control ow instructions (branches, jumps, calls, returns)
Encoding an ISA ( xed or variable length)
All the conceptual aspects of the architecture
4 . 2
2-ND ASPECT OF COMPUTER ARCHITECTURE
Microarchitecture (organization) is a concrete
implementation of the ISA, the high-level aspects of a
processor design (memory system, memory interconnect,
design of the processor internals).
Pipeline width
Instruction latencies
Issue wight and scheduling
Speculation capabilities
All the concrete aspects of the architecture
4 . 3
3-ND ASPECT OF COMPUTER ARCHITECTURE
Hardware or chip (design) is the speci cs of a computer,
including the logic design and packaging. This is a concrete
implementation of the microarchitecture.
Tech-process
Clock rates
On die placement
All the properties of the chip
5 . 1
ARCHITECTURE: ARMV8-A
μarch IP Hardware
Cortex-A53 ARM Octa Exynos 7(7580) 1.6GHz
28nmHKMG
Cortex-A57 ARM Octa Exynos 7(7420) big.LITTLE 2.1/1.5
14FF ( LPE) (Samsung)
Cortex-A72 ARM Deca MediaTek Helio X20 big.LITTLE
2.5/2.0/1.4 20nmHKMG (TSMC)
Cortex-A35 ARM -
5 . 2
ARCHITECTURE: ARMV8-A
μarch IP Hardware
Denver NVIDIA Dual Tegra K1 2.3GHz
28nmHPM
Kryo Qualcomm Tetra S820 big.LITTLE 2.2/1.6GHz
14FF ( LPP) (Samsung)
Exynos M1 Samsung Quad Exynos 8890 big.LITTLE
2.6/2.29GHz 14FF ( LPP)
(Samsung)
5 . 3
ARCHITECTURE: ARMV8-A
μarch IP Hardware
Cyclone Apple Dual A7 (APL0698) 1.4GHz
28nmHKMG (Samsung)
Typhoon Apple Dual A8 (APL1012) 1.3GHz
20nmHKMG (TSMC)
Twister Apple Dual A9 (APL0898) 1.85GHz
16nmFF+ (TSMC)
Dual A9 (APL1022) 1.85GHz
14nmFF( LPP) (Samsung)
6 . 1
LATENCY VS THROUGHPUT ARCHITECTURES
Latency oriented architecture
addresses latency hiding issues;
features sophisticated pipelining;
out-of-order;
employs advanced cache hierarchies;
widely uses speculation.
Compute cores occupy only a small part of a die.
6 . 2
LATENCY VS THROUGHPUT ARCHITECTURES
Throughput oriented architecture
performs a bunch of operations in y;
features many simple compute units/cores;
employs simple pipelines and large register le to
provide a low-cost thread scheduling;
uses wide basses, tiling, programmable local memory.
Compute cores occupy most part of a die.
7
KEY ARCHITECTURE FAMILIES
RISC
Reduced Instruction Set Computer
CISC
Complex Instruction Set Computer
VLIW
Very Long Instruction Word
Vector architecture
8 . 1
CISC
Complex Instruction Set Computer
Designed in the 1970s which was a time where transistors
were expensive while compilers were naive. Additionally,
instruction packaging was the main concern due to
shortage of memory. The latency of the memory was just a
bit higher then registers.
The goal was to de ne an instruction set that allows high
level language constructs be translated into as few
assembly language instructions as possible, improving
performance as well as code density.
Examples are VAX, x86, AMD64.
Latency-oriented architecture.
8 . 2
CISC
Heritage
instructions access memory, a plenty of addressing modes,
many instruction families and a very rich variable length
ISA (alignment counts!),
consequently, complicated instruction decoding logic.
Moreover, a few registers are available for programmers.
Nowadays
1. Instructions are broken down into μcode which are
much easy to pipeline and process power ef ciently.
2. Transistors are spent to cache hierarchies, out-of-order
execution, large RB and speculation to eliminate stalls.
3. Symmetric multi-processing.
9 . 1
RISC
Reduced Instruction Set Computer
Designed in the 1980s which was a time there IPL was the
great concern. The memory-processor gap already began
to come out.
The goal was to decrease the number of clocks per
instruction (CPI) while pipeline instructions as much as
possible employing hardware to help with it. Uniform ISA,
pipelining and large register le is a must-have.
Examples are MIPS, ARM, PowerPC.
Latency-oriented architecture.
9 . 2
RISC
Heritage
Relatively few instructions, all are the same length.
Only load and store instructions access memory.
Large resister le than typical CISC processors have.
No μcode
Nowadays Most architectures that comes from RISC are
called Load-Store architectures, while may employ μops.
They combine concepts of a classic RISC with usage of
modern hardware enhancements:
1. deep pipelines, multi-cycle instructions,
2. out-of-order execution,
3. speculation.
10
THE HARDWARE/SOFTWARE GAP
Compiler
analyzes control ow, analyzes dependency
schedules instructions
maps variables to limited register set
Hardware
analyzes control ow, analyzes dependency
schedules instructions
remaps ISA register to large internal register set
11
A WORD TOWARDS REGISTERS
In deed, registers are temporary storage locations inside the
processor that hold data and addresses.
Local variables are not the same as registers in ISA, since
compiler uses IR internally and does register allocation
close to the end of optimization process.
Registers provided by ISA is not the same as actual
registers on the processor. Internal reorder buffers which
hold decoded instruction parameters and intermediate
results are closer to classic de nition of a register le.
12 . 1
VLIW
Very Long Instruction Word
Designed in the 1980s which was a time there IPL was the
great concern.
The goal was to pipeline instructions as much as possible
employing software to help with it reducing complexity of
the hardware and mitigate the the Hardware/Software
gap. Boost processor clock simplifying work per cycle.
Example is IntelHP Itanium.
Throughput-oriented architecture
12 . 2
VLIW
Heritage
Compiler determines which instructions can be
performed in parallel,
bundles this information and the instructions,
and passes the bundle(word) to the hardware.
No data dependencies between instructions in a word.
Each operation in a word assigned to speci c issue slot
(dedicated FU).
12 . 3
VLIW
Nowadays
hardly any generic processor implements VLIW
brunchy nature of production codes (in contrast to
HPC or scienti c codes),
need to follow binary compatibility across the
μarchitecture families.
Whereas architecture is widely adopted for
programmable co-processors where shrink in power
consumption without lose of performance is crucial
(DSP, GPU).
13 . 1
VECTOR PROCESSORS
First introduced in 1976 and dominated for HPC in the
1980s because of high instruction throughput.
The goal was to perform operations on vectors of data
exposing data level parallelism (DLP) to increase
instruction throughput. Vector pipelining is also called
chaining.
Example is Cray
Throughput-oriented architectures.
13 . 2
VECTOR PROCESSORS
Heritage
Process the data in vectors, each element in a vector
(lane) is independent on any other.
Deep pipelines, wide execution units, not necessary the
same width (batch length) as size of vector in elements
(vector length).
Most ef cient for simple memory patterns, but
getter/scatter is usually possible too.
Wide memory interfaces to saturate execution units.
Large vector register le, cache is not a strict
requirement and absent for classical vector processors.
13 . 3
VECTOR PROCESSORS
Nowadays
They aren't used in generic processors design, but used
as a co-processors for a speci c workloads: HPC,
multimedia.
Precursors of most designs of modern GPUs.
Vector pipes with short vector length (8-16 bytes) called
SIMD units are widely integrated in modern general
purpose processors to accelerate most demanding
loops.
14 . 1
WHY IS IT DOING TO BE RISC LOAD/STORE?
1. Simple xed-width instructions & few addressing modes
Cache-ef cient instruction fetch, branches are aligned.
Simple hardware logic → power ef cient chips.
Drive a higher clock rate.
2. Concise ISA with orthogonal functionality
Complex instructions are ignored by compilers due to
semantic gap → simple instructions simplify scheduling.
Complex addressing lead either to variable length
instructions or big instruction size → inef cient
decoding and scheduling as well as alignment issues.
3. Large register set
Expose possible instruction parallelism to the compiler.
15
LATEST TRENDS
Architecture is seen as Load-store RISC-inherited
Internally instructions are broken down into single-pipe
μops
μops are reordered and optionally organized into words
μops or words are scheduled for execution, caching in the
highest level is usually performed on this preprocessed
view.
Latest generations of Intel processors, NVIDIA Denver
architecture and 64-bit ARM Cortex-A processors already
employ this approach.
16 . 1
SUMMARY
There are three key aspects of computer architecture:
Instruction Set Architecture, μarchitecture and design.
Some architectures aim to hide latency while others aim to
maximize instruction throughput.
CISC is created for compact code size and exact instruction
encoding and used only on ISA level nowadays.
RISC leads to less complicated decoding and pipeline stages
allow boosting clock in affordable power budget.
VLIW targets power ef cient high performance devices for
speci c tasks or used internally on μarchitecture level.
Vector processors transformed into SIMD-extensions and
SIMT-like GPU designs.
16 . 2
SUMMARY
Loads-Store architectres with its simple xed-width
instructions, few addressing modes, concise ISA
and optimal size register size is a winner solution.
Architecture can expose different properties
for it's different levels (ISA, μarchitecture).
17
THE END
/ 2015-2016MARINA KOLPAKOVA

Pragmatic optimization in modern programming - modern computer architecture concepts

  • 1.
    1 PRAGMATIC OPTIMIZATION IN MODERN PROGRAMMING MODERNCOMPUTER ARCHITECTURE CONCEPTS Created by for / 2015-2016Marina (geek) Kolpakova UNN
  • 2.
    2 COURSE TOPICS Ordering optimizationapproaches Demystifying a compiler Mastering compiler optimizations Modern computer architectures concepts
  • 3.
    3 OUTLINE Three aspects ofthe computer architecture Latency vs Throughput architectures Architecture families CISC RISC VLIW Vector Why is it doing to be load/store? Latest trends summary
  • 4.
    4 . 1 1-STASPECT OF COMPUTER ARCHITECTURE Instruction Set Architecture or ISA (interface) is a contract between HW and SW, which speci es right, possibilities & limitations. Class of ISA (load-store, register-memory) Memory addressing modes & rules (base-immediate, alignment requirements) Types & sizes of operands (size of byte, short) Operations (general arithmetic, control, logical) Control ow instructions (branches, jumps, calls, returns) Encoding an ISA ( xed or variable length) All the conceptual aspects of the architecture
  • 5.
    4 . 2 2-NDASPECT OF COMPUTER ARCHITECTURE Microarchitecture (organization) is a concrete implementation of the ISA, the high-level aspects of a processor design (memory system, memory interconnect, design of the processor internals). Pipeline width Instruction latencies Issue wight and scheduling Speculation capabilities All the concrete aspects of the architecture
  • 6.
    4 . 3 3-NDASPECT OF COMPUTER ARCHITECTURE Hardware or chip (design) is the speci cs of a computer, including the logic design and packaging. This is a concrete implementation of the microarchitecture. Tech-process Clock rates On die placement All the properties of the chip
  • 7.
    5 . 1 ARCHITECTURE:ARMV8-A μarch IP Hardware Cortex-A53 ARM Octa Exynos 7(7580) 1.6GHz 28nmHKMG Cortex-A57 ARM Octa Exynos 7(7420) big.LITTLE 2.1/1.5 14FF ( LPE) (Samsung) Cortex-A72 ARM Deca MediaTek Helio X20 big.LITTLE 2.5/2.0/1.4 20nmHKMG (TSMC) Cortex-A35 ARM -
  • 8.
    5 . 2 ARCHITECTURE:ARMV8-A μarch IP Hardware Denver NVIDIA Dual Tegra K1 2.3GHz 28nmHPM Kryo Qualcomm Tetra S820 big.LITTLE 2.2/1.6GHz 14FF ( LPP) (Samsung) Exynos M1 Samsung Quad Exynos 8890 big.LITTLE 2.6/2.29GHz 14FF ( LPP) (Samsung)
  • 9.
    5 . 3 ARCHITECTURE:ARMV8-A μarch IP Hardware Cyclone Apple Dual A7 (APL0698) 1.4GHz 28nmHKMG (Samsung) Typhoon Apple Dual A8 (APL1012) 1.3GHz 20nmHKMG (TSMC) Twister Apple Dual A9 (APL0898) 1.85GHz 16nmFF+ (TSMC) Dual A9 (APL1022) 1.85GHz 14nmFF( LPP) (Samsung)
  • 10.
    6 . 1 LATENCYVS THROUGHPUT ARCHITECTURES Latency oriented architecture addresses latency hiding issues; features sophisticated pipelining; out-of-order; employs advanced cache hierarchies; widely uses speculation. Compute cores occupy only a small part of a die.
  • 11.
    6 . 2 LATENCYVS THROUGHPUT ARCHITECTURES Throughput oriented architecture performs a bunch of operations in y; features many simple compute units/cores; employs simple pipelines and large register le to provide a low-cost thread scheduling; uses wide basses, tiling, programmable local memory. Compute cores occupy most part of a die.
  • 12.
    7 KEY ARCHITECTURE FAMILIES RISC ReducedInstruction Set Computer CISC Complex Instruction Set Computer VLIW Very Long Instruction Word Vector architecture
  • 13.
    8 . 1 CISC ComplexInstruction Set Computer Designed in the 1970s which was a time where transistors were expensive while compilers were naive. Additionally, instruction packaging was the main concern due to shortage of memory. The latency of the memory was just a bit higher then registers. The goal was to de ne an instruction set that allows high level language constructs be translated into as few assembly language instructions as possible, improving performance as well as code density. Examples are VAX, x86, AMD64. Latency-oriented architecture.
  • 14.
    8 . 2 CISC Heritage instructionsaccess memory, a plenty of addressing modes, many instruction families and a very rich variable length ISA (alignment counts!), consequently, complicated instruction decoding logic. Moreover, a few registers are available for programmers. Nowadays 1. Instructions are broken down into μcode which are much easy to pipeline and process power ef ciently. 2. Transistors are spent to cache hierarchies, out-of-order execution, large RB and speculation to eliminate stalls. 3. Symmetric multi-processing.
  • 15.
    9 . 1 RISC ReducedInstruction Set Computer Designed in the 1980s which was a time there IPL was the great concern. The memory-processor gap already began to come out. The goal was to decrease the number of clocks per instruction (CPI) while pipeline instructions as much as possible employing hardware to help with it. Uniform ISA, pipelining and large register le is a must-have. Examples are MIPS, ARM, PowerPC. Latency-oriented architecture.
  • 16.
    9 . 2 RISC Heritage Relativelyfew instructions, all are the same length. Only load and store instructions access memory. Large resister le than typical CISC processors have. No μcode Nowadays Most architectures that comes from RISC are called Load-Store architectures, while may employ μops. They combine concepts of a classic RISC with usage of modern hardware enhancements: 1. deep pipelines, multi-cycle instructions, 2. out-of-order execution, 3. speculation.
  • 17.
    10 THE HARDWARE/SOFTWARE GAP Compiler analyzescontrol ow, analyzes dependency schedules instructions maps variables to limited register set Hardware analyzes control ow, analyzes dependency schedules instructions remaps ISA register to large internal register set
  • 18.
    11 A WORD TOWARDSREGISTERS In deed, registers are temporary storage locations inside the processor that hold data and addresses. Local variables are not the same as registers in ISA, since compiler uses IR internally and does register allocation close to the end of optimization process. Registers provided by ISA is not the same as actual registers on the processor. Internal reorder buffers which hold decoded instruction parameters and intermediate results are closer to classic de nition of a register le.
  • 19.
    12 . 1 VLIW VeryLong Instruction Word Designed in the 1980s which was a time there IPL was the great concern. The goal was to pipeline instructions as much as possible employing software to help with it reducing complexity of the hardware and mitigate the the Hardware/Software gap. Boost processor clock simplifying work per cycle. Example is IntelHP Itanium. Throughput-oriented architecture
  • 20.
    12 . 2 VLIW Heritage Compilerdetermines which instructions can be performed in parallel, bundles this information and the instructions, and passes the bundle(word) to the hardware. No data dependencies between instructions in a word. Each operation in a word assigned to speci c issue slot (dedicated FU).
  • 21.
    12 . 3 VLIW Nowadays hardlyany generic processor implements VLIW brunchy nature of production codes (in contrast to HPC or scienti c codes), need to follow binary compatibility across the μarchitecture families. Whereas architecture is widely adopted for programmable co-processors where shrink in power consumption without lose of performance is crucial (DSP, GPU).
  • 22.
    13 . 1 VECTORPROCESSORS First introduced in 1976 and dominated for HPC in the 1980s because of high instruction throughput. The goal was to perform operations on vectors of data exposing data level parallelism (DLP) to increase instruction throughput. Vector pipelining is also called chaining. Example is Cray Throughput-oriented architectures.
  • 23.
    13 . 2 VECTORPROCESSORS Heritage Process the data in vectors, each element in a vector (lane) is independent on any other. Deep pipelines, wide execution units, not necessary the same width (batch length) as size of vector in elements (vector length). Most ef cient for simple memory patterns, but getter/scatter is usually possible too. Wide memory interfaces to saturate execution units. Large vector register le, cache is not a strict requirement and absent for classical vector processors.
  • 24.
    13 . 3 VECTORPROCESSORS Nowadays They aren't used in generic processors design, but used as a co-processors for a speci c workloads: HPC, multimedia. Precursors of most designs of modern GPUs. Vector pipes with short vector length (8-16 bytes) called SIMD units are widely integrated in modern general purpose processors to accelerate most demanding loops.
  • 25.
    14 . 1 WHYIS IT DOING TO BE RISC LOAD/STORE? 1. Simple xed-width instructions & few addressing modes Cache-ef cient instruction fetch, branches are aligned. Simple hardware logic → power ef cient chips. Drive a higher clock rate. 2. Concise ISA with orthogonal functionality Complex instructions are ignored by compilers due to semantic gap → simple instructions simplify scheduling. Complex addressing lead either to variable length instructions or big instruction size → inef cient decoding and scheduling as well as alignment issues. 3. Large register set Expose possible instruction parallelism to the compiler.
  • 26.
    15 LATEST TRENDS Architecture isseen as Load-store RISC-inherited Internally instructions are broken down into single-pipe μops μops are reordered and optionally organized into words μops or words are scheduled for execution, caching in the highest level is usually performed on this preprocessed view. Latest generations of Intel processors, NVIDIA Denver architecture and 64-bit ARM Cortex-A processors already employ this approach.
  • 27.
    16 . 1 SUMMARY Thereare three key aspects of computer architecture: Instruction Set Architecture, μarchitecture and design. Some architectures aim to hide latency while others aim to maximize instruction throughput. CISC is created for compact code size and exact instruction encoding and used only on ISA level nowadays. RISC leads to less complicated decoding and pipeline stages allow boosting clock in affordable power budget. VLIW targets power ef cient high performance devices for speci c tasks or used internally on μarchitecture level. Vector processors transformed into SIMD-extensions and SIMT-like GPU designs.
  • 28.
    16 . 2 SUMMARY Loads-Storearchitectres with its simple xed-width instructions, few addressing modes, concise ISA and optimal size register size is a winner solution. Architecture can expose different properties for it's different levels (ISA, μarchitecture).
  • 29.