1. Kalray SA. Confidential - All Rights Reserved. 1
www.kalrayinc.com
IF-CONVERSION FOR A PARTIALLY
PREDICATED VLIW ARCHITECTURE
Benoît Dupont de Dinechin, CTO
GNU Cauldron 2022
2. Kalray SA. Confidential - All Rights Reserved. 3
Kalray SA. Confidential - All Rights Reserved. 3
1.Kalray MPPA Processor and KVX Core
2.KVX Code Generation Features
3.GCC IF-Conversion Framework
4.Extending GCC IF-Conversion
5.First Results and Outlook
AGENDA
3. Kalray SA. Confidential - All Rights Reserved. 4
Kalray SA. Confidential - All Rights Reserved. 4
KALRAY DPU-BASED ACCELERATION CARD
Kalray SA. Confidential - All Rights Reserved. 4
4. Kalray SA. Confidential - All Rights Reserved. 5
Kalray SA. Confidential - All Rights Reserved. 5
MPPA® COOLIDGE V1
Block Diagram & Feature List 80 VLIW Application Cores
• 64-bit/32-bit 6-issue VLIW core
• From 600MHz to 1.2 GHz
• 16KB I/D cache with MMU
• IEEE 754 FP16, FP32, FP64 FPU
• Up to 256-bits per cycle Load/Store
80 Tensor Co-processors (per core)
• INT8.32, INT16.64, FP16.32
• Up to 128 MAC equivalent per cycle
Compute Clusters (5)
• +1 Management/Security Core
• 4 MB of Memory / L2 Cache
• 600GB/s bandwidth
2x100GbE Ethernet Interface & Mger
• 8x1/8x10/8x25/2x40/4x50/2x100 GbE
• Jumbo Frame Support (9.6KB)
• Support for PTP/IEEE 1588v2
• Priority Flow Control (PFC), IEEE 802.1Qbb
• Checksum offload Header & Payload
• Hash & Round-robin dispatcher
Security
• Secure boot with authentication & encryption
• TRNG, RSA, Diffie-Hellman, DSA, ECC, EC-DSA
and EC-DH acceleration
PCIe Gen4 Interface
• 16-lane PCIe GEN4 Endpoint (EP) or
Root Complex (RC)
• Bifurcation up to 8 downstream ports in
RC mode
• SR-IOV up to 8 PF / 248 VF
• Address translation and protection
• Up to 2048 MSI-X & 64 MSI
• Support for Hot Plug
• Up to 512 DMAs for multi queues /
kernel bypass drivers
• Direct PCIe-to-clusters and PCIe-to-
DDR transfers
LPDDR4/DDR4 Interface
• 64-bit DDR4/LPDDR4-3200 channels
with sideband/inline ECC
• Up to two ranks per DDR4 Channel
• 2 DDR channels (up to 32GB) with
channel interleaving
Cryptography Accelerators
(optional)
• AES-128/192/256
(ECB/CBC/ICM/CTR/GCM/GMAC/CCM)
, AES-XTS, MD5/SHA-1, SHA-2, SHA-3,
Kazumi/Snow 3G, ZUC
5. Kalray SA. Confidential - All Rights Reserved. 6
ANDEY BOSTAN COOLIDGE v1 COOLIDGE v2 DOLOMITES(3)
PROCESS 28 nm 28 nm 16 nm 16 nm 6/5 nm
PERFORMANCE 1 TOPS 1.3 TOPS
20 TOPS (1)
4 TFLOPS (2)
190 KDMIPS
50 TOPS (1)
25 TFLOPS (2)
190 KDMIPS
200 TOPS (1)
100 TFLOPS (2)
380 KDMIPS
USE CASES /
MARKET
Prototyping
40G Data Center
Auto Prototypes
Data Center / Edge
Automotive (proto)
Data Center / Edge
Automotive (proto)
Data Center
Edge Computing
5G
CONSUMPTION
(WATTS)
25W 25W 20W(4) 20W(4) 20W(4)
PROTOTYPING PRODUCTION AVAILABLE
UNDER
DEVELOPMENT
UNDER SPECIFICATION
2015 2020 1H 2023 2025
(1) INT8.32
(2) FP16.32
(3) Initial target – may changes
(4) 50W maximum compute workload
MPPA® PROCESSORS
Product Family
2012
6. Kalray SA. Confidential - All Rights Reserved. 7
CLASSIC VLIW ARCHITECTURE (J. A. FISHER) EPIC VLIW ARCHITECTURE (B. R. RAU)
Key architecture features
• SELECT operation on Boolean value
• Conditional load/store/FPU operations
• Dismissible loads (non-trapping)
• [Multi-way conditional branches]
Key compiler techniques
• Trace scheduling (global instruction scheduling)
• Partial predication (S. Freudenberger if-conversion)
Main examples
• Multiflow TRACE processors
• HP Labs Lx « Embedded Computing: a VLIW Approach »
• STMicroelectronics ST200 (media processor based on Lx)
Key architecture features
• Fully predicated ISA
• Speculative loads (control speculation)
• Advanced loads (data speculation)
• Rotating registers
Key compiler techniques
• Modulo scheduling (software pipelining)
• Full predication (R-K algorithm, J. Fang algorithm)
Main examples
• Cydrome Cydra-5
• HP-intel IA64
• TI C6x DSPs
VERY LONG INSTRUCTION WORD (VLIW) ARCHITECTURES
Compiler-driven instruction-level parallel execution
Simple, energy-efficient, time-predictable implementations
7. Kalray SA. Confidential - All Rights Reserved. 8
MPPA COOLIDGE 64-BIT VLIW CORE
VLIW CORE PIPELINE
Kalray VLIW (KVX) architecture is co-designed to
appear as an in-order superscalar to compilers
• Every scheduler parallel instruction group is a valid bundle
• No need for vertical or horizontal no-op padding
Vector-scalar ISA
• 64x 64-bit general-purpose registers
• Operands can be single registers, register pairs (128-bit) or
register quadruples (256-bit)
• 128-bit/256-bit SIMD instructions by dual-issuing/quad-issuing 64-
bit instructions on the ALUS or by using the FPU data-path
DSP capabilities
• Counted or while hardware loops with early exits
• Non-temporal loads (L1 cache bypass / preload)
• Non-trapping memory loads (faulting bytes return 0)
CPU capabilities
• 4 privilege levels (rings), MMU (runs Linux kernel)
• Recursive ISA virtualization (Popek & Goldberg)
8. Kalray SA. Confidential - All Rights Reserved. 9
Kalray SA. Confidential - All Rights Reserved. 9
1.Kalray MPPA Processor and KVX Core
2.KVX Code Generation Features
3.GCC IF-Conversion Framework
4.Extending GCC IF-Conversion
5.First Results and Outlook
AGENDA
9. Kalray SA. Confidential - All Rights Reserved. 10
ACCESSCORE® SOFTWARE DEVELOPMENT KIT
A Complete Toolchain & Standard Libraries
Standard
Programming
Environment
(C/C++/OpenCL)
Operating
Systems
& Libraries
(Linux / ClusterOS)
Deep Learning
Mathematics
Computer Vision
Software framework for offloading numerical, signal and image processing
KAF™
CNN inference code generator compatible with standard CNN frameworks (Caffe, …)
KaNN™
3rd Party OS RTOS
3rd Party Tools Model-Based Development
POSIX THREAD, OpenMP
OpenCL, Eclipse
GCC, GDB, LLVM, QEMU
SLEEF, SIMDe
OPEN CV, BLAS, LAPACK
CNN Inference Code Gen.
Exokernel, Open Source
POSIX RTOS, Linux,
Communication Libs
AccessCore®
for a seamless
integration
AccessCore®
SDK
AccessCore®
Runtime
Optimized
Librairies
Compiler,
Simulator,
Debugger &
System Trace
</>
10. Kalray SA. Confidential - All Rights Reserved. 11
C/C++ COMPILER SUPPORT OF KVX CORE
GCC 10 for lightweight POSIX OS and for Linux (all TLS models)
• Mapping of hardware loops using GCC doloop patterns
• Sched2 does instruction scheduling and instruction bundling
• Most high-gain optimizations apply (such as auto-vectorization)
• On-going developments in software pipelining (derived from C6x)
11. Kalray SA. Confidential - All Rights Reserved. 12
CACHE BYPASS LOADS AND NON-TRAPPING LOADS
Reuse of the GCC named address spaces (not available in C++)
12. Kalray SA. Confidential - All Rights Reserved. 13
EXPLOITATION OF THE VECTOR-SCALAR ARCHITECTURE
128-bit and 256-bit vectors are operated and passed as 64-bit register pairs and quadruples
• Full support of the GCC vector syntax extensions
• Align vectors on register pair/quad on ABI boundaries
• SIMD lane splatting and shuffling rely on the BMM8
(8x8 Bit-Matrix Multiply) operations, exposed as
SBMM8 instructions (swapped operands)
• To improve register allocation, vector instructions are
kept as machine instruction pairs or quadruples until
after register allocation
• Use of partial instruction bundles in output templates,
with suitable scheduling type
13. Kalray SA. Confidential - All Rights Reserved. 14
SIMDE EMULATION OF X86
BUILTINS
SIMDe translates the x86 builtin
functions into native call on x86
(SIMDE_X86_SSSE3_NATIVE) and
plain C code on other architectures
(SIMDE_VECTORIZE)
Kalray port of SIMDe provides an
optimized translation on KVX using
the GCC/LLVM KVX builtin functions
(SIMDE_KVX_NATIVE)
14. Kalray SA. Confidential - All Rights Reserved. 15
KVX CONDITIONAL
BRANCH TEMPLATES
KVX condition codes live in the
general-purpose registers
The "cstore<m>" standard pattern
produces 0 or 1
(STORE_FLAG_VALUE)
The "*cbsi" pattern matches a
conditional branch depending on the
comparison (EQ, NE, LE, LT, GE,
GT) of a source register to zero
The KVX can compare two integer or
floating-point values and instruction
variants negate the result (0 or -1)
15. Kalray SA. Confidential - All Rights Reserved. 16
KVX CONDITIONAL MOVE
TEMPLATES
Conditional moves can be produced
in two ways
The "*cmovsi.df" SET source is an
IF_THEN_ELSE that relies on a
zero_comparison_operator
Genconfig outputs in insn-config.h
#define HAVE_conditional_move
The "*cond_exec_movedf" pattern is
a COND_EXEC wrapping of a simple
SET expression
Genconfig outputs in insn-config.h
#define HAVE_conditional_execution
16. Kalray SA. Confidential - All Rights Reserved. 17
KVX CONDITIONAL LOAD
AND STORE TEMPLATES
Load instructions format sub-64-bit
values with zero or sign extension
Plain load and store addressing
modes include [reg], offset[reg],
reg[reg], reg*size[reg]:
Conditional load and store
addressing modes are restricted to
[reg], offset[reg]:
17. Kalray SA. Confidential - All Rights Reserved. 18
Kalray SA. Confidential - All Rights Reserved. 18
1.Kalray MPPA Processor and KVX Core
2.KVX Code Generation Features
3.GCC IF-Conversion Framework
4.Extending GCC IF-Conversion
5.First Results and Outlook
AGENDA
18. Kalray SA. Confidential - All Rights Reserved. 19
GCC AUTOMATED
COND_EXEC TEMPLATES
GCC can automate the writing of
COND_EXEC instruction templates
with the (define_cond_exec) template
and the "predicable" attribute
The (define_insn) templates with
(eq_attr "predicable" "yes") have their
RTL template wrapped into a
COND_EXEC with the condition
supplied by the (define_cond_exec)
The output template of the resulting
instructions is prefixed by the output
template of the (define_cond_exec)
Custom output may use the
current_insn_predicate RTX
19. Kalray SA. Confidential - All Rights Reserved. 20
GCC IF-CONVERSION
OVERVIEW (1)
Enabled with –fif-conversion and
–fif-conversion2
Three passes:
• CE1 before combine
• CE2 after combine
• CE3 after reload
Information about the if-conversion
region is passed with a ce_if_block
structure
Top level (if_convert) iterates over
if-conversion region header blocks
by calling (find_if_header)
20. Kalray SA. Confidential - All Rights Reserved. 21
GCC IF-CONVERSION
OVERVIEW (2)
(find_if_header)
• Fill the ce_if_block structure
• Call IFCVT_MACHDEP_INIT
• Before reload (CE1 and CE2),
call (noce_find_if_block)
• After reload (CE3) and if target
has conditional execution, call
(cond_exec_find_if_block)
Default target hook for TARGET_
HAVE_CONDITIONAL_EXECUTION
returns HAVE_conditional_execution
21. Kalray SA. Confidential - All Rights Reserved. 22
GCC IF-CONVERSION
OVERVIEW (3)
(noce_find_if_block)
• Determine the if-conversion
region: IF-THEN-ELSE-JOIN or
IF-THEN-JOIN or IF-ELSE-JOIN
• First try without, then with, using
conditional moves
22. Kalray SA. Confidential - All Rights Reserved. 23
GCC IF-CONVERSION
OVERVIEW (4)
(cond_exec_find_if_block)
• Identify cases of && tests (jump
to ELSE block) or || tests (jump to
THEN block)
• In case of && or || tests, try to
combine then into the conditional
expression
• If no or failed on multiple test
region, process IF-THEN-ELSE-
JOIN etc.
(cond_exec_process_if_block)
• Find common head or tail
sequences in IF-THEN-ELSE-
JOIN
• Dispatch to
(cond_exec_process_insns)
23. Kalray SA. Confidential - All Rights Reserved. 24
GCC IF-CONVERSION
OVERVIEW (5)
(cond_exec_process_insns)
• Process instructions from START
to END, as there can be matching
head and tail sequences in the
THEN and ELSE blocks
• If instruction pattern code is
already COND_EXEC, build a
new condition by ANDing with the
block condition
• Generate COND_EXEC pattern
• Call IFCVT_MODIFY_INSN
which can modify the pattern or
abort if-conversion
24. Kalray SA. Confidential - All Rights Reserved. 25
Kalray SA. Confidential - All Rights Reserved. 25
1.Kalray MPPA Processor and KVX Core
2.KVX Code Generation Features
3.GCC IF-Conversion Framework
4.Extending GCC IF-Conversion
5.First Results and Outlook
AGENDA
25. Kalray SA. Confidential - All Rights Reserved. 26
Extend GCC conditional
execution to the KVX
predicated load and stores
instructions that have
addressing mode
restrictions
Unconditionally compute
the original result into a
scratch register then
conditionally move the
result to the original
destination register
PREDICATION OF
LOADS AND STORES
PSEUDO-PREDICATION
OF INSTRUCTIONS
Eliminate the need for
computing into a scratch
register and conditional
move if the destination is
only locally used in the
THEN or ELSE block
SPECULATIVE
EXECUTION OF
INSTRUCTIONS
KVX IF-CONVERSION OBJECTIVES AND CONSTRAINTS
Focus on scalar instructions, as the GIMPLE auto-vectorization takes care of generating
masked vector operations
Complement the if-conversion provided the standard patterns for conditional operations:
move<m>cc, add<m>cc, neg<m>cc, not<m>cc
No changes to the target-independent GCC code
Can only expose the predicated instructions after register allocation
Unconditional assigments to scratch registers must not clobber registers in use
26. Kalray SA. Confidential - All Rights Reserved. 27
KVX IF-CONVERSION
OVERVIEW (1)
Implemented with four target hooks
• MAX_CONDITIONAL_EXECUTE
• IFCVT_MACHDEP_INIT (kvx.h)
called in CE1, CE2, CE3 from
(find_if_header)
• IFCVT_MODIFY_INSN (kvx.h)
called in CE3 from
(cond_exec_process_insns)
• TARGET_HAVE_CONDITIONAL
_EXECUTION (kvx.c) called in
CE1, CE2, CE3
In combination with COND_EXEC
patterns and helper patterns in the
kvx .md files
27. Kalray SA. Confidential - All Rights Reserved. 28
KVX IF-CONVERSION
OVERVIEW (2)
(kvx_ifcvt_machdep_init)
• Let CE1 and CE2 do if-conversion
without conditional execution
• In CE2, prepare for CE3, focusing
on IF-THEN-ELSE-JOIN, IF-
THEN-JOIN, IF-ELSE-JOIN
regions identified with same logic
as in (noce_find_if_block)
The idea is to insert USEs and
pseudo-DEFs in CE2 so that the
CE3 if-conversion will have the spare
hard registers it needs for pseudo-
predication and speculation
28. Kalray SA. Confidential - All Rights Reserved. 29
KVX IF-CONVERSION
FIND CANDIDATES (1)
(kvx_ifcvt_ce2_candidate_ce3)
• Scan the non-jump instructions
• Bail-out if complex instruction or
instructions with side-effects
• Try conditional moves (with
COND_EXEC), if fail will have a
second chance as arithmetic
• Try conditional memory accesses
(irrespective of addressing mode)
• Try to speculate the non-trapping
arithmetic instructions
• Try to pseudo-predicate the non-
trapping arithmetic instructions
29. Kalray SA. Confidential - All Rights Reserved. 30
KVX IF-CONVERSION
FIND CANDIDATES (2)
(kvx_ifcvt_ce2_cond_mem_ce3)
• If need a scratch register to
compute address, reserve it by
wrapping the original pattern and
a USE inside a PARALLEL
• Success if recognize the original
pattern or the wrapped pattern
inside a COND_EXEC
(kvx_ifcvt_ce2_cond_arith_ce3)
• Similar, except that always wrap
with USE of a scratch register that
has the mode of destination
(kvx_ifcvt_ce2_spec_arith_ce3)
• If the destination register is only
locally used (not live-out), may
speculatively execute unchanged
30. Kalray SA. Confidential - All Rights Reserved. 31
KVX PREPARE FOR CE3
IF-CONVERSION (1)
<prepare for CE3 if-conversion> in
(kvx_ifcvt_machdep_init)
• Extend the live-range of tested
register by inserting its USE at
end of THEN and ELSE blocks
• Update pattern of the pseudo-
predicated memory and arithmetic
insns to the one recognized in
(kvx_ifcvt_ce2_cond_mem_ce3)
or (kvx_ifcvt_ce2_cond_arith_ce3
31. Kalray SA. Confidential - All Rights Reserved. 32
KVX PREPARE FOR CE3
IF-CONVERSION (2)
<prepare for CE3 if-conversion> in
(kvx_ifcvt_machdep_init)
• Flag the speculated instructions
with REG_NONNEG note (hack,
unused otherwise in this port)
• Insert USE of speculated
destination register in JOIN block
• Insert DEFs of scratch registers in
TEST block and USEs of scratch
registers in JOIN block
These DEFs and USEs prevent the
allocation of the same hard registers
to the scratch registers in one path
and live variables on the other path
32. Kalray SA. Confidential - All Rights Reserved. 33
KVX FINALIZE IF-
CONVERSION
(kvx_ifcvt_modify_insn)
• Implements target hook
IFCVT_MODIFY_INSN (CE3)
• Undo the COND_EXEC of pattern
by (cond_exec_process_insns) in
case of the inserted pseudo-DEFs
• Undo the COND_EXEC of pattern
by (cond_exec_process_insns) in
case of speculated instructions
33. Kalray SA. Confidential - All Rights Reserved. 34
COND_EXEC OF MEMORY
LOADS
(cond_exec_process_insns) tries to
CON_EXEC instruction patterns
• COND_EXEC of loads with the
"memsimple" operand predicate
must appear first (shown earlier)
• COND_EXEC of loads with a
"memory" operand predicate not
"memsimple" is not valid, so use
a (define_insn_and_split) to
simplify the addressing mode.
• In case CE3 fails, provide another
(define_insn_and_split) to undo
the PARALLEL wrapping done by
(kvx_ifcvt_ce2_cond_mem_ce3)
Similar patterns for loading with
zero/sign extension
34. Kalray SA. Confidential - All Rights Reserved. 35
COND_EXEC OF MEMORY
STORES
Similar to the COND_EXEC of
memory loads
• COND_EXEC of stores with the
"memsimple" operand predicate
must appear first (shown earlier)
• Use a (define_insn_and_split) to
simplify the addressing mode in
case of a "memory" operand
predicate which is not
"memsimple"
• In case CE3 fails, provide another
(define_insn_and_split) to undo
the wrapping with PARALLEL
done in CE2 by
(kvx_ifcvt_ce2_cond_mem_ce3)
35. Kalray SA. Confidential - All Rights Reserved. 36
COND_EXEC OF NON-
PREDICABLE ARITHMETIC
(cond_exec_process_insns) tries to
CON_EXEC the instruction patterns
• As there are no predicated
arithmetic instructions in the KVX
ISA, pseudo-predicate them
• Use a (define_insn_and_split) to
compute into scratch register,
then conditionally move it to the
original destination
• In case CE3 fails, provide
another (define_insn_and_split)
to undo the wrapping with
PARALLEL done in CE2 by
(kvx_ifcvt_ce2_cond_arith_ce3)
Similar patterns are needed for
most of the scalar ISA subset
36. Kalray SA. Confidential - All Rights Reserved. 37
CLEANUPS OF NON IF-
CONVERTED REGIONS
Undo the PARALLEL wrapping done
by (kvx_ifcvt_ce2_cond_arith_ce3)
with the (define_insn_and_split)
patterns previously shown
Also deactivate the previously
inserted UNSPEC_DEFs by splitting
them into USE
As CE3 is not always run, set the
kvx_ifcvt_ce_level to enable splitting
of the UNSPEC_DEFs and the
unwrapping the pseudo-predicated
instructions
This is done in the machine reorg
pass, which requires that all splitting
be done before doloop finalization
and sched2
37. Kalray SA. Confidential - All Rights Reserved. 38
Kalray SA. Confidential - All Rights Reserved. 38
1.Kalray MPPA Processor and KVX Core
2.KVX Code Generation Features
3.GCC IF-Conversion Framework
4.Extending GCC IF-Conversion
5.First Results and Outlook
AGENDA
38. Kalray SA. Confidential - All Rights Reserved. 39
EXAMPLE OF KVX IF-CONVERSION (BEFORE)
COND_MEM (# 14)
COND_MOVE (# 19)
COND_MEM (# 20)
SPEC_ARTITH (# 21)
COND_MEM (# 22)
39. Kalray SA. Confidential - All Rights Reserved. 40
EXAMPLE OF KVX IF-CONVERSION (AFTER)
COND_MEM (# 14)
COND_MOVE (# 19)
COND_MEM (# 20)
SPEC_ARTITH (# 21)
COND_MEM (# 22)
40. Kalray SA. Confidential - All Rights Reserved. 41
MORE EXAMPLES OF KVX IF-CONVERSION (1)
COND_ARTITH (# 26)
41. Kalray SA. Confidential - All Rights Reserved. 42
MORE EXAMPLES OF KVX IF-CONVERSION (2)
COND_MEM (# 26)
42. Kalray SA. Confidential - All Rights Reserved. 43
MORE EXAMPLES OF KVX IF-CONVERSION (3)
COND_MEM (# 26)
COND_ARTITH (# 27)
43. Kalray SA. Confidential - All Rights Reserved. 44
• On the SSA form: requires
extensions such as Psi-SSA
• Before register allocation
o CMOVE only: GEM compiler
o Fully predicated: R-K and J.
Fang algorithms (IA64)
o Partially predicated: S.
Freudenberger (TRACE)
• After register allocation:
GCC ports IA64, C6x, FRV
• Apply pseudo-predication
and local speculation after
register allocation
• The scratch registers that
will be unconditionally
defined are reserved before
register allocation with
UNSPEC_DEFs and USEs
• The GCC FRV port looks for
unused hard registers after
reload instead
EXISTING SCALAR IF-
CONVERSION
APPROACHES
KEY FEATURES OF
THE KVX SCALAR IF-
CONVERSION IN GCC
• Cannot reuse the existing
GCC (define_cond_exec)
machinery, as it may only
generate (define_insn)
patterns
• Automate generation of the
(define_insn_and_split)
patterns that enable
pseudo-predication
• Performance tuning for
scalar (while) loop pipelining
NEXT STEPS
SUMMARY AND OUTLOOK
Implemented scalar if-conversion in GCC for the partially predicated KVX architecture
Relies on the IFCVT framework, but activate it before (CE1, CE2) and after (CE3) reload
44. Kalray SA. Confidential - All Rights Reserved. 45
www.kalrayinc.com
THANK YOU