2. 2
Agenda
My background building Real Computers
Challenges with today’s Superscalar Computers
Lessons and Proposals for Future Computers
– Constrained Designs: ie backwards compatible with pragmatic compromises
– Lessons from the last several years at Intel
– Unconstrained Designs: Unlocking more performance potential
Conclusions
3. 3
My experience building real computers
Carry Save Arithmetic
– In 1954 I developed “Carry Save Arithmetic” (for multiplication, division and square root) as my
student project, and presented at a Russian conference in 1955
– Precedes the first western publication of CSA by M. Nadler was in Acta Technica journal (1956)
Chief architect of Elbrus-1, Elbrus-2, and Elbrus-3 line of supercomputers
– My team built Elbrus-line computers (1978-90) widely used in Russia, eg for space program, etc.
– High level programming language support put in hardware (not just support of the existing HLL
corrupted by outdated architecture) – still not implemented so far in other computers
– High Level Language EL–76 for Elbrus-line computers
– Elbrus OS kernel had support for real High Level programming
One of first complete security solutions
– Elbrus architecture, the main goal of which is real HLL EL–76 support, and Elbrus OS kernel as a
byproduct, fully solved security problems, including the possibility to prove the correctness of user-
level programs.
4. 4
My experience building real computers (continued)
• First industrial implementation of an Out-of-Order superscalar computer
– Elbrus 1 (implemented in 1978) was the first commercial implementation of OoO superscalar
in the world (two-wide issue computer)
– After the second generation of Elbrus computers in 1985, our team realized many weaknesses
with superscalar approach and started looking for more robust solution of the parallel
execution problem, leading us to VLIW.
• Elbrus-3: A Very Long Instruction Word (VLIW) computer
– Successful implementation of cluster-based VLIW architecture with fine grained parallel
execution (Elbrus 3, end of 90s), probably for the first time in industry
• Hardware assisted Binary Translation
– Suggestion and the first implementation of Binary Translation (BT) technology for designing a
new architecture, built on radically new principles, but binary compatible with the old ones
(Elbrus 3, end of 90s).
• Fine-grained parallel architecture
– Design and simulation of radically new principles of fine-grained parallel architecture and
extension of HLL (like EL – 76) and OS (like Elbrus OS kernels) for their support.
6. 6
Drawbacks of Superscalar Paradigm - 1
Drawbacks of Superscalar architecture
– Program conversion is rather complicated (parallel->sequential->parallel)
– Superscalar architecture has a performance limit (regardless of available HW)
– Inability to use properly all available HW
– Even SMT mode cannot significantly improve efficiency (but decreases cache
utilization efficiency instead)
– Rather complicated VECTOR HW and MULTI-THREAD programming have to be
used to compensate somehow for this performance limit
– Today’s High-level languages (HLL) mirror the old and present-day architectures
(linear data space, no explicit parallelism). As a result, current architecture has
corrupted all today’s HLLs
– Current organization of computations does not allow for good optimizations
(necessary to have full information about the algorithm to be executed, and
hardware, which will execute it)
– Non-universal architecture
7. 7
Drawbacks of Superscalar Paradigm -2
Memory and caches organization
– Current architecture does not support object oriented data memory.
– This excludes possibility to support true security computing and debugging facility
– Cache organization of today’s architecture hides its internal structure, preventing the
compiler to do good optimizations. This has been made for compatibility with the
simple linear memory organization in older computers
Superscalar architecture today is very close to un-improvable
state, including all the above mentioned drawbacks
All the above-mentioned drawbacks have the single source –
inheriting of principles of ancient, early days computing with strong HW
size constraints for current architecture as its basic ones
8. 8
Beginning of Computer Era (early 50s – mid 90s) - 1
Single execution unit era
– Amount of available HW was the main constraint
– Single IP, single execution unit, linear memory of small size
– Performance is just a number of executed operations (fast memory vs. ops execution time)
– Binary programming was the most efficient method
– The programmer was responsible for all optimizations as he knew both the algorithm and
available HW resources. HW was very simple at that time, so the programmer was able to fulfil
this job very well
– The only reasonable HW improvement was the possibility to improve this single execution unit
9. 9
Beginning of Computer Era (early 50s – mid 90s) - 2
• General results for that period architecture:
This architecture was un-improvable with corresponding constraints, because the
main resource (single execution unit) was un-improvable (carry save and high radix
arithmetic) and every architecture had to include it
This architecture was absolutely universal among programmable architectures,
because any other architecture should include this single execution unit. No other
architecture could work faster, or could have less HW. Usage of more HW ( more
execution units, for example) was not possible because of the main constraint of
available HW
• Basic Architecture Decisions:
Single Instruction Pointer ISA
Simple linear memory organization
No data types support in HW
Input binary includes instructions how to use resources,
rather than the algorithm description
10. 10
Superscalar Era (mid 90s – now) - 1
Constraints of Superscalar era
– Significant Progress in Si technology, more HW available (HW constraint was
removed), faster execution, but slow memory
– Superscalar still is unable to use efficiently all HW for a single job
– Implicit parallelization, but it requires to convert a linear single IP execution flow into
the parallel form in HW
– The original completion ordering has to be preserved, from parallel execution into the
consecutive retirement (compatibility with the preceding decisions)
– Simple linear memory organization, no support for data types
11. 11
Superscalar Era (mid 90s – now) - 2
Outcome of this period:
Sub-optimal functionality (semantics of data and operations)
– Without dynamic data types support in HW it is impossible to implement real high
level programming and true security computing
Sub-optimal performance
– Programmer doesn’t know the details of rather complicated HW and as a result is
unable to fully control optimizations made by HW
– The compiler does not have all information about the algorithm being compiled
(due to corrupted High-Level languages), and on the other side, the compiler is
too far from the HW and is unable to fully utilize the HW and the internal HW
structures (e.g. caches), which are hidden from the compiler
– Superscalar Hardware is expressed via ISA only (which inherits all obsolete
solutions), no ability to provide the algorithm to such kind of HW, and all HW
machinery (BPU, renaming, cache organization, etc.) is designed to support
compatibility with limited performance improvement
13. 13
Algorithmically Oriented Post-Superscalar Era
Changing the angle of view:
– Algorithm of the program itself and data dependency are the real constraints of the performance
and power
– Move HW complexity into SW, free HW from code analysis and parallel conversion (closer to
algorithm representation)
– Move the design into a strongly opposite direction – from resources to algorithms care
14. 14
Constraints in Architecture are the Real Limiter
• Two designs will be considered:
CONSTRAINED system
– New Architecture (NArch) constrained by compatibility with legacy binaries (x86,
ARM, Power, etc.)
UNCONSTRAINED system
– Advanced New Architecture (NArch+) without compatibility constraints
(unconstrained), or more precisely – constrained only by the algorithm to be executed,
or by HW resources of the processor
• All past designs have reached their constraints:
– Arithmetic, Early day Single Execution Unit architecture, Superscalar, Functionality of High
level programming
• Therefore, to make the next step we should find some way of how to relax (for
the first case of future architecture), or to remove (for the second case) the
constraints
15. 15
Basic Approach for New Architecture Design
• Let’s first design the best possible unconstrained architecture
• The constrained architecture is going to be just the unconstrained architecture
limited by several mechanisms, required for compatibility support
• So we will get the Best Possible unconstrained and constrained
architectures then!
• Three components must be fully investigated and designed to get the Best
Possible Architecture:
Language
Compiler
Hardware
16. 16
New High-Level Programming Support
The Compiler should have full information about the algorithm being compiled
The new programming language should be able to expose the details of the algorithm to the
compiler and, eventually, to HW
Programmer should optimize only the algorithm, but not execution
New Language should have the following main features:
– Ability to express the parallel fine-grained structure of the algorithm in perfectly clear and convenient
(for programmer) manner
– Right functionality (semantics) of its elements, including dynamic data types and capability support *)
– Ability to present exhaustive information about the algorithm
*) This feature was completely implemented in EL-76 language used in several generation of Elbrus
computers in Russia
17. 17
Compiler
Role of compiler:
– Compiler is responsible for all optimizations (not HW)
– To do this it should be model local, which allows it to have all information about model configuration
– It gets all information about the algorithm from the program text after a simple transformation into an
intermediate distributive to be compiled to different computer models
– No information losses during compilation (full algorithm representation)
– Compiler can use some dynamic information from the execution for being able to tune optimizations
dynamically
The structure of HW elements should be appropriate for good optimizations controlled by the
compiler
Local to model compiler removes compatibility requirements from HW, as HW can be changed
more freely, if it’s needed to satisfy some requirements (e.g. performance, power, market
segments, etc.)
18. 18
Process of Compilation
The first-level compiler generates a distributive w/o any optimizations (simple transformation from
source code to data flow graph without information losses)
The optimizing “real” compiler (distributive, or D-compiler) is model dependent and generates
optimized application code from the app distributive (using dynamic feedback for tuning)
Application
source code
Distributive
App App
Optimizing
D-compiler
System
layer
HW model 1
App App
Optimizing
D-compiler
System
layer
HW model 2
First-level
compilation
(transformation)
19. 19
Requirements for New Architecture Hardware
• Hardware should not do any optimizations (e.g. BPU, prefetching), as it doesn’t
have any information about the algorithm being executed
• Release hardware from the necessity to analyze binaries and extract parallelism
• Hardware should only allocate resources according to compiler instructions
• Hardware should avoid “artificial binding” as Single Instruction Pointer, vectors,
cache lines, full virtual pages, etc.
• Hardware should give the compiler a possibility to change HW configuration for
better optimizations (“Lego Set” HW)
• Hardware should use object oriented memory (like in Elbrus computers)
20. 20
NArch Architecture (constrained compatible case)
• The semantics of legacy binaries cannot be changed due to compatibility requirements
• The only possible relaxation would be to change the way of how this semantics gets
presented to HW in explicit parallel form for execution
• Release hardware from the necessity to analyze binaries and extract parallelism
• Let the software layer be responsible for finding available parallelism and optimizations (via
Binary Translation technology)
• Let HW be responsible for optimal scheduling only (remove unneeded complexity from
hardware and make it simpler) – like in the unconstrained case
• Actually Binary Translation allows using all mechanisms of the unconstrained architecture,
with addition of:
o Memory ordering rules and retirement
o Checkpoint for target context reconstruction and events processing
o Memory renaming technique for memory conflicts resolution in binaries via bigger register file and
special guard HW structure
• Unfortunately, due to semantics compatibility reasons the constrained architecture cannot
support security and aggressive procedure level parallelization
22. 22
In the constrained architecture functionality (semantics) of all its elements
(data and operations) is strongly determined by compatibility requirements
But first let’s consider the unconstrained computer system and its elements,
which were developed in accordance with the approach described above.
Note: All technologies and mechanisms are appropriate for both the constrained
and the unconstrained systems
Method of New Functionality Design
23. 23
Primitive Data Types & Operations
Primitive data types (HW keeps their types together with the value):
– Potential infinity (integer)
– Potential continuity (floating point)
– Predicates
– Enumerable types (e.g. character)
– Uninitialized data
– Data Descriptor and Functional Descriptor (“auxiliary” data types for technical
operations)
Primitive Data Types are Dynamic Data Types
– Value is kept together with tag
Type Safety Approach
– All primitive operations check types of their arguments
24. 24
User Defined Data Types (Objects)
The “natural” requirements for the new architecture to support language level
functionality, consistent with “abstract algorithm” ideas:
1. Every procedure can generate a new data object and receive a reference to this new
object
2. This procedure, using received reference, can do everything possible with this new object
(read data from this object and update the content, execute this object as a program, and
delete the object)
3. No other procedure can access this object just after it was generated, but this procedure
can give a reference to this object with all or limited rights listed above to anybody it
knows (has a reference to it)
4. Any procedure can generate a copy of reference to any object it’s aware of with decreased
rights
5. After the object has been deleted, nobody can access it (all existing references are invalid)
Data creation with orientation on objects is an important step for data structuring,
according to semantics of the source algorithm
25. 25
Dangling Pointers and Memory Compaction
To solve the dangling pointer problem (point 5) we must guarantee that after an object
has been deleted, no one can access the memory occupied by this object.
The de-allocation procedure frees the physical memory, but not the virtual memory. So
physical memory can be reused, but virtual memory still remains being allocated
The well-known classical solution is a garbage collection algorithm, but it’s inefficient for
solution of the dangling pointer problem
When virtual memory gets close to its limit, the system starts compacting the virtual
memory
The compaction algorithm*):
– Each Data Descriptor is tagged, i.e. there is a special bit in registers and in memory which marks Data
Descriptors
– The system identifies what Data Descriptors are useless (point to objects de-allocated in physical memory)
and replaces them by Uninitialized data, or just re-directs them to non-existent memory page, thus releasing
the virtual pages which the descriptor had pointed to (according to the size of the object)
– The rest of the objects are moved to the vacant virtual memory, and their Data Descriptor’s base address is
replaced by the new virtual address
This compaction can be fulfilled as a background process
*Note: this compaction algorithm has been implemented in Elbrus-1,2 computers, it can be modified
to make it more efficient
26. 26
Procedures
Procedure is the fundamental notion of HLL. Every procedure has a reference to its code and context.
The procedure context consists of the code, global data, parameters/return data, and its local data
A procedure can be called via Functional Descriptor only (tagged value)
Entry point
address
Global
context
Functional Descriptor
Tag
Global data Procedure
code Local data
1. A procedure can create Functional Descriptor (FD) with a special instruction, providing an entry point
address and a Data Descriptor to some context as arguments, i.e. any procedure can define another
procedure
2. A procedure, which has generated this FD, can give this new FD to anybody it has access, and this
new owner also can call this new procedure via FD
3. A procedure that has generated an FD includes references to the code and global data into this FD
4. A procedure, which got FD of the new procedure, can call this procedure and can pass it some
parameters (atomically).
5. Caller can receive some return data as a result of procedure execution. Data return is logically an
atomic action
6. The called procedure can’t use anything beyond the context it has been provided by the functional
descriptor and the parameters
27. 27
Capability Mechanism
Only the system that provides type safety allows the correct implementation of the
procedure mechanism. A procedure can be called via Functional Descriptor only
Procedure has access to its context only. No other procedure can access this
procedure’s context, if it has not been passed as a parameter to that other procedure
This approach introduces a very strong inter-procedure protection
Data Descriptor (DD) and Functional Descriptor (FD) is a capability to do something
for the procedure, which has DD or FD in its context:
– DD is a capability to access some object
– and FD is a capability to do something – execute some procedure, which can modify some global data
in the called procedure, the data, which is not directly accessible by caller
Implementation of some operations, which should work with bit-level representations
of special data types like DD, FD (COMPACTION algorithm is a good example)
sometimes need operation support in HW. All these operations are also primitive
operations; however, only a limited number of procedures should be able to use
them
28. 28
Full Solution of Security Problem
The described approach does not need a privileged mode for system programming
– E.g. in Elbrus, all programs, including OS, are written as “application” programs
Capability approach is more powerful and more general than the privileged mode
approach (consistently implemented in Elbrus; no C-list, which is wrong)
However, even this architecture cannot protect against mistakes in user
programs. Probably, the only possible remedy in this case is possibility to prove
correctness of user and kernel program
– A formal proof of functional correctness was done for seL4 microkernel in 2009 by NICTA group
(National Information and Communications Technology, Australia)
Even in this case, only the suggested architecture can be helpful to simplify
considerably the proof of program correctness (for both kernel and applications)
30. 30
Object Oriented Memory (OOM) Structure
Object oriented memory was initially introduced in Burroughs B5500 computer architecture, but was not
implemented correctly
All basic principles were carefully designed first in Elbrus 1 (1972-78)
Present-day memory and cache systems are corrupted by compatibility with linear structure of old
computers. That means that future system should not use a traditional memory and caches organization,
which excludes compiler from applying efficient optimizations
OOM structure even for constrained architecture (according to preliminary estimations) can decrease
cache sizes by up to 2-3 times and nearly exclude performance losses due to cache misses
Object oriented physical memory approach:
– The size of physical memory allocated for an object is equal to the object size
– Each allocated object is also loaded in the virtual space with pages of fixed size
– Each new object in virtual space is allocated starting from a new page contiguously (if the size of the object is
smaller than the page size, then the end of the virtual space of this page is empty)
Virtual Memory
Object N
EMPTY
EMPTY
Object M
Physical
Memory
31. 31
OOM uses virtual numbers of the objects instead of virtual memory addresses
Virtual page number is allocated sequentially during each object generation
There exists a system register, which keeps the next still free object number being
used for the next object being generated
We will use sometimes the expression “virtual address” having in mind “virtual
number”.
Object’s virtual
number N
Index
Virtual Page N (1)
Virtual Page N (2)
Virtual Address
Object N
EMPTY
Next Object Number SysReg
N+1
Object Oriented Memory: Objects Naming Rules
32. 32
Allocation of Objects and Sub-objects in Caches
Unlike today’s TLB, used in contemporary computers, in this OOM architecture TLB
translates virtual address not into memory physical address, but directly into physical
location in some specific cache, where this piece of data is located
In each specific cache, as well as in memory, the new architecture does not use cache
lines (like superscalar does)
Object’s parts allocated on cache levels are split into smaller parts, and all these parts
belong to the same virtual page
Each cache level could have its own small TLB
33. 33
Generation of an Object
A special instruction in HW is used to generate an object (no SW library calls, as e.g.
malloc, no OS system calls)
The list of all occupied spaces is contained in TLB, and the system supports special lists for
all free spaces. Each free-list maintains the free areas of a certain set of the sizes (more
likely of power of 2)
For physical address allocation, HW should take physical address from one of the free-lists
(the first empty chunk from the corresponding list - also from a special HW register)
The result of the instruction execution is the corresponding Data Descriptor.
GENOBJ
Object Type
Object Size
Free-lists
2
4
8
2N
Data Descriptor
Note: Links are located inside free memory chunks
34. 34
The Compiler Controls OOM Usage
This memory/cache system organization allows the compiler to have a strong control of
execution process
Compiler is aware of all program semantics information and can perform more sophisticated
optimizations
Compiler can preload the needed data to high-level cache, at first without assigning a more
precious register memory, and can move these data from cache to registers only at the last
moment. But now even preloading directly into the registers sometimes could be a good
alternative – now we have a big register file.
This cache organization allows using access to the first level cache directly from an
instruction by physical addresses without using virtual address and associative search.
To do this, the base register (BR) can support a special mode, in which it includes pointers
to the physical location of the first level cache together with its virtual address.
35. 35
Explicitly Parallel Instruction Execution in NArch+
In NArch+ architecture all mutually independent executable objects can be
executed in parallel to each other. This includes:
– Operations
– Chains of dependent operations inside scalar and/or iterations of loop code
– Procedures
– Jobs
NArch+ overcomes difficulties and constraints of Data Flow and Single IP
approaches, excludes any “artificial binding” in HW (program is a parallel
graph)
Two different approaches have been investigated in NArch+ for program data
graph execution: strands and streams (see next slides)
36. 36
STRANDs Oriented Architecture
HW scheduler
RFEXEC
Parallel HW
• Strands express parallelism via chains of data dependent (mainly) operations (in more natural
way than e.g. in VLIW) and provide new opportunity for presenting parallelism to OoO HW
• Simple instruction scheduling for parallel execution
– Need to look only at the oldest instructions in each Strand (much smaller and simpler RS)
• Strands also provide:
– Bigger effective instruction window
– Reduced register usage (via intra-strand accumulators)
– Wider instruction issue width (via clustering with register-to-register communication)
• Adding ability to express parallelism in uISA gives additional advantages, e.g. superior control
over speculation and control over power, better HW utilization, much more opportunities for
optimizations, and for resolving the memory latency issue
HW scheduler
EXEC RF
HW scheduler
EXEC RF
Cluster 1 Cluster 2
Inter
connect
Original data
graph
IP1
IP2
IP3
strands
37. 37
Strands are extracted from the program data graph by the compiler
Each strand is executed by HW in-order, but out-of-order relative to each other
HW allocates a set of resources for each active strand (called WAY)
The compiler creates a strand via special FORK operation, which takes a free WAY for the
strand execution
BUT the compiler has to be aware of the number of WAYs available in HW and to schedule
strands accordingly. Otherwise there could be a deadlock situation (e.g. no free way to
spawn new strands, and other strands are waiting for some result from this new strand)
Having Strand (WAY) as a resource for the compiler potentially limits parallelism
Drawbacks of the STRANDs Architecture
Way 0 Way 2Way 1
FORK A
A:
B:
FORK B
38. 38
DL/CL Mechanism for Register/Predicate Reuse
Definition-Line (DL):
– Definition Line L is a group of DL-instructions in different streams, which
form an explicit DL-front dividing streams into intervals
– DL-front crosses all alive streams according to possible timing analysis.
Fronts are successive – no cross each other
Check-Line (CL):
– Check Line (CL) is a group of CL-instructions suspending execution of
some streams until the specified DL-front is completely passed
– After that a corresponding register/predicate resource can be safely reused
A
+DL
B
C
D
E
G
H
I
K
L
M
N
O
P
Q
+DL +DL
+DL +DLF
R
S
time
CL -2
39. 39
Intelligent Branch Processing
– Conventional: Branch predict one
path, discard everything when wrong
– New Architecture: Speculate when
necessary, discard only misspeculated
work
– Increases performance
– Reduces wasted energy due to
misspeculation
– According to our statistics, 80% of
branches are not critical and can be
executed without speculation
40. 40
STREAMs Oriented Architecture
Streams and How They Get Created
• First let’s describe the simplest case, when an algorithm to be executed is a scalar by its nature (acyclic
data-dependency graph) without conditional braches
• Let’s have the total number of operations equal to the number of available registers (single assignment,
no register reuse)
• For this simple case:
– No decoding stage (each instruction is ready to be loaded into the corresponding execution unit, the compiler
prepares the code)
– For each instruction in the graph the compiler calculates a “Priority Value Number” (PVN). This number is the
number of clocks from this instruction up to the end of the graph along the longest path. Compiler will present the
code in a number of sequences of dependent instructions - “streams”
– As the first instruction in the new stream, the compiler takes an instruction with the highest PVN, not included
yet into any other stream. For each next instruction in this stream, the compiler again selects an instruction with
the highest PVN, data dependent on the previous instruction in the stream. And so on, until the stream reaches
either the end of the scalar code, or until it gets into some other stream.
Streams decompositionData Dependency graph
Stream 1 Stream 2
Stream 3
41. 41
Scalar Code Execution With STREAMs
Execution Engine (Workers)
Register File:
– Each register has an EMPTY/FULL bit (EMPTY - to prevent from reading the register, when value is not ready yet, and
FULL – to prevent from writing to the register, when not all dependent instructions have consumed the value)
– Each register has an additional bit showing, if an operation generating the value for this register has been already sent to
an execution unit (EU) or is in the Reservation Station (RS)
Main scheduling and execution mechanisms for Streams are “workers” (16 per cluster)
How the workers work:
– Workers issue ready instructions to the RS/Execution units (the arguments are FULL, or predecessors are in the RS/EU)
– Each register has a list of streams, waiting for the result in this register
– If a waiting stream is ready for execution (the value is ready), it gets moved to the “waiting for a free worker” queue
– A free worker takes an instruction from the “waiting for workers queue” or from the Instruction Buffer
– If an argument of the next instruction in the stream is not ready yet, the worker stops executing this stream and puts it
into the waiting queue for this argument (register)
42. 42
NArch+: Scalar Code Execution
More Complex Case (Bigger Code)
If scalar code is big enough, the DL/CL technique is applied for registers reuse to guarantee
correct dynamic execution of streams and optimal utilization of the Instruction Buffer
When the code before CLN has been executed, it is necessary to preload the next part of the code
between CLN and CLN+1. Similarly, when DLN is crossed, all code area above can be freed
The size of code between CLN and CLN+1 is not bigger than the size of the Register File
Time of execution can be improved with the help of the Dynamic Feedback mechanism (both in
HW and SW)
If there are conditional branches in the code, the compiler uses speculative streams to handle
these cases efficiently (predicated streams and GATE instruction to check predicate value and to
kill one of the streams in case of wrong speculation)
More details on speculation techniques (e.g. load/store speculation, efficient branch handling
without branch prediction) would require more low-level micro-architecture details. Alas!
This scalar technology is nearly the same both for constrained and unconstrained versions of the
architecture
This scalar code execution technique is a practical
implementation of Data Flow architecture
43. 43
Summary: Strands vs. Streams
Strands
HW scheduler
RFEXEC
Parallel HW
Original program graph
The mechanism of strands execution (one way per strand) is
visible to the compiler, so the compiler has to watch how many
strands are going to be executed by HW at each moment, and
the number is limited by the number of ways
Ways in HW
Streams
Cons: can lead to deadlock, limits parallelism due to explicit
resource (ways) scheduling by the compiler
The compiler can create any number of streams, the
mechanism of streams execution is not visible to the compiler
Pro: No deadlock, HW executes the original graph, natural data
flow execution mechanism
Original program graph
HW scheduler
RFEXEC
Parallel HW
Workers
RS
44. 44
NArch+: Code with Loops
Use loop iteration parallelism (both iteration internal and inter-
iteration) as fully as possible
Loop iterations analysis performed by the compiler:
– Find instructions, which are self-dependent over iteration
– Find the groups of instructions, which being self-dependent, are also
mutually dependent over the iterations (“rings” of data dependency)
– The rest of instructions create sequences or graph of dependent
instructions (a number of “rows”)
– The result of each row is either an output of the iteration (STORE, for
example), or is used by another row(s) or ring(s).
Each “ring” and/or “row” loop is producing data, which are
consumed by other small loops. Each producer can have a
number of consumers. However, producer and consumer should
be connected through a buffer, giving possibility for producer to
go forward, if consumer is not ready yet to use these data
45. 45
Loops Handling in NArch+
Differences between NArch and NArch+ in loops implementation:
– NArch+ does not need to support compatibility with Single IP approach; therefore, many different
loops can be executed together (even a “single” loop can also be executed out-of-order)
– NArch+ has a simple memory system without speculative buffers; therefore, in some cases
(speculations only) it is necessary to use some other mechanisms and a new HW support
Types of loops, handled by NArch+:
– RECURRENT loop (including WHILE loop)
– DO ALL (trip count is known before the loop start)
– DO ALL (trip count becomes known during the loop execution only)
– Loop with low probable “maybe” dependence between iterations (through memory) (including
WHILE loop)
– Loop with “maybe” data dependence within iterations
46. 46
Parallel Procedure Execution
• For constrained architecture, procedure can be executed on a different number of
clusters, but no more than four.
• Compiler will try to do in-lining of as many called procedures as possible to be able
to use the resulted procedure parallelism in full degree.
• As usual for constrained case, caller will wait for the end of called procedure and
will work with the same resources.
• Call, as well as return, is logically atomic step, however, to increase performance
using DL/CL technology there will be prolog and epilog areas, where both caller
and callee are working together without interfering with each other.
• In unconstrained architecture, new HLL allows parallel procedures execution, but
again each procedure will use no more than four clusters.
• If some procedure has DO ALL loop inside, this loop can use all available HW
(many, up to all clusters in chip - ~60 today).
48. 48
NArch/IA Architecture (IA compatible case study)
NArch/IA is x86 compatible new micro-architecture based on strands approach
– NArch Strand – a sequence of (usually dependent, but can include control flow) operations with its own
IP; strands are executed out-of-order, in parallel
– BT parses IA binaries, extracts strands and provides them to HW for scheduling and execution
– Multiple strands allow overlapping of memory accesses (thus improving memory latency)
A fairly wide CPU due to scalable clustering
– One or two bi-clusters (up to 4 clusters and 24 instructions issue width – 16 strands per cluster)
– Clusters are tightly-coupled (register-to-register communication and synchronization)
Very large sparse instruction window
– Much larger than in conventional superscalar (~1K instructions)
– Branch resolution in large window (no HW branch predictor)
– Memory disambiguation in large window
– Smart retirement in large window (no retirement for registers)
Binary Translation for IA compatibility and enabling
NArch uarch
– Dynamic and static BT for maximum ST/MT performance and
efficiency
Highly parameterized architecture (scalability)
– Variable number of clusters/strands per cluster
– Dynamically reconfigurable machine (ST/MT)
Result is higher Performance and lower power at the same time
49. 49
Advantages of the New Architecture
Compatible (constrained) case
• This approach can ensure full compatibility with some of existing binaries (ARM,
x86, POWER, RISC-V, etc.) or even with all of them on the same HW with the
help of Binary Translation
• Preliminary investigations allow us to do the following rather reliable
predictions:
– A compatible version (NArch) can reach the best possible, un-improvable performance
restricted by binary semantics constraints (not by binary’s sequential presentation) and amount
of resources available for specific model only
– ~3x-4x ST performance @ unconstrained power vs. OOO Core
– ~2x ST performance @ iso-power
– Less than ~50% of power @ iso-performance
– ~2x MT performance @ iso-power vs OOO Core
50. 50
Advantages of the New Architecture
Un-Compatible (unconstrained) case
• If we release HW architecture from the requirement to maintain
compatibility with old style programming, then:
– We can significantly simplify the architecture (e.g. 70-75% of constrained architecture has the
burden of maintaining compatibility with SS)
– Introduce explicit parallelism in programming languages to expose the algorithm structure to HW
more easily
– Introduce security in HW (tagged architecture) and, eventually, get rid of viruses and make
programming safe and reliable
– Get rid of obsolete cache memory hierarchy (object oriented memory)
– Eventually, increase significantly the performance (up to 5x-7x or even more)
– Improve scalability and universality (new distributive, HW model-oriented compiler)
– Build absolutely un-improvable computer architecture
• As a result of high universality of this architecture we can hope that now
all special applications like machine learning, computer vision, graphics
will be supported well with high performance
53. 53
Each TLB entry besides helping to translate virtual address into physical data location can
include also some documentation of referenced object: its size, its user data type (Object Type
Name - OTN), and maybe some other information
It also includes references to more detailed tables of physical locations of all elements of this
object in cache(s)
Each object should not be necessarily presented in memory. Some objects can be generated,
for example, in DCU (Level 1cache) only
Access
rights
Object
Number
Sub-object
information
TLB Entry
Data Descriptor
Object
Size
Object
Type
Physical
location
TLB Entry
TLB
TLB Entry
DCU
LLC
MLC
Physical Memory
+
Index
Object or
sub-object
Object Oriented Memory: TLB Structure
54. 54
Advantages of Object Oriented Memory System
Unlike superscalar, OOM memory/cache system is visible to compiler – no uncontrollable
physical pages, lines, cache structure hidden from the compiler. This helps significantly to
improve the efficiency
Explicit object oriented structure helps to increase efficiency of memory usage. All free
memory is explicitly visible to the compiler and HW
Ability to access the first level cache using physical addresses directly from instructions
promises a huge increase in efficiency
Inexpensive memory allocation (without OS and library calls) also helps to increase
efficiency and makes it simple to design Operating System
Eviction process is explicitly controlled by the compiler
Compiler has full knowledge of cache structure and can make nearly all procedure-local
data as resident in the first level cache and can make them accessible by physical
addresses, this substantially will decrease cache misses.
Cache size will also be reduced
The compiler can control objects and sub-objects allocation and preloading
55. 55
STREAMs Oriented Architecture
Removing Drawbacks of STRANDs Approach
Get the maximum parallelism available in Program Data Graph and execute Graph itself
Still chains of data dependent operations are presented to HW, but they are just hints -
STREAMs, not the real resource
New mechanism of STREAMs execution – WORKERs
No deadlocks anymore as streams are not the static scheduling resource in the compiler
(any number of streams), HW “workers” dynamically choose operations from the ready
streams and dispatch them to the Reservation Station for execution
More details on next slides…
1 3
2 4
7
5
8
9
11
10
6
Program Data Graph
WORKERs
74
2
138
1
3
7
Reservation
Station
Execution
1
3
7
8
2
2
8
STREAMs