SlideShare a Scribd company logo
1 of 55
11
Boris Babayan
Intel Fellow
October 2016
A Perspective on the
Future of Computer Architecture
2
Agenda
 My background building Real Computers
 Challenges with today’s Superscalar Computers
 Lessons and Proposals for Future Computers
– Constrained Designs: ie backwards compatible with pragmatic compromises
– Lessons from the last several years at Intel
– Unconstrained Designs: Unlocking more performance potential
 Conclusions
3
My experience building real computers
 Carry Save Arithmetic
– In 1954 I developed “Carry Save Arithmetic” (for multiplication, division and square root) as my
student project, and presented at a Russian conference in 1955
– Precedes the first western publication of CSA by M. Nadler was in Acta Technica journal (1956)
 Chief architect of Elbrus-1, Elbrus-2, and Elbrus-3 line of supercomputers
– My team built Elbrus-line computers (1978-90) widely used in Russia, eg for space program, etc.
– High level programming language support put in hardware (not just support of the existing HLL
corrupted by outdated architecture) – still not implemented so far in other computers
– High Level Language EL–76 for Elbrus-line computers
– Elbrus OS kernel had support for real High Level programming
 One of first complete security solutions
– Elbrus architecture, the main goal of which is real HLL EL–76 support, and Elbrus OS kernel as a
byproduct, fully solved security problems, including the possibility to prove the correctness of user-
level programs.
4
My experience building real computers (continued)
• First industrial implementation of an Out-of-Order superscalar computer
– Elbrus 1 (implemented in 1978) was the first commercial implementation of OoO superscalar
in the world (two-wide issue computer)
– After the second generation of Elbrus computers in 1985, our team realized many weaknesses
with superscalar approach and started looking for more robust solution of the parallel
execution problem, leading us to VLIW.
• Elbrus-3: A Very Long Instruction Word (VLIW) computer
– Successful implementation of cluster-based VLIW architecture with fine grained parallel
execution (Elbrus 3, end of 90s), probably for the first time in industry
• Hardware assisted Binary Translation
– Suggestion and the first implementation of Binary Translation (BT) technology for designing a
new architecture, built on radically new principles, but binary compatible with the old ones
(Elbrus 3, end of 90s).
• Fine-grained parallel architecture
– Design and simulation of radically new principles of fine-grained parallel architecture and
extension of HLL (like EL – 76) and OS (like Elbrus OS kernels) for their support.
5
Challenges with today’s
Superscalar Processors
6
Drawbacks of Superscalar Paradigm - 1
 Drawbacks of Superscalar architecture
– Program conversion is rather complicated (parallel->sequential->parallel)
– Superscalar architecture has a performance limit (regardless of available HW)
– Inability to use properly all available HW
– Even SMT mode cannot significantly improve efficiency (but decreases cache
utilization efficiency instead)
– Rather complicated VECTOR HW and MULTI-THREAD programming have to be
used to compensate somehow for this performance limit
– Today’s High-level languages (HLL) mirror the old and present-day architectures
(linear data space, no explicit parallelism). As a result, current architecture has
corrupted all today’s HLLs
– Current organization of computations does not allow for good optimizations
(necessary to have full information about the algorithm to be executed, and
hardware, which will execute it)
– Non-universal architecture
7
Drawbacks of Superscalar Paradigm -2
 Memory and caches organization
– Current architecture does not support object oriented data memory.
– This excludes possibility to support true security computing and debugging facility
– Cache organization of today’s architecture hides its internal structure, preventing the
compiler to do good optimizations. This has been made for compatibility with the
simple linear memory organization in older computers
Superscalar architecture today is very close to un-improvable
state, including all the above mentioned drawbacks
All the above-mentioned drawbacks have the single source –
inheriting of principles of ancient, early days computing with strong HW
size constraints for current architecture as its basic ones
8
Beginning of Computer Era (early 50s – mid 90s) - 1
 Single execution unit era
– Amount of available HW was the main constraint
– Single IP, single execution unit, linear memory of small size
– Performance is just a number of executed operations (fast memory vs. ops execution time)
– Binary programming was the most efficient method
– The programmer was responsible for all optimizations as he knew both the algorithm and
available HW resources. HW was very simple at that time, so the programmer was able to fulfil
this job very well
– The only reasonable HW improvement was the possibility to improve this single execution unit
9
Beginning of Computer Era (early 50s – mid 90s) - 2
• General results for that period architecture:
 This architecture was un-improvable with corresponding constraints, because the
main resource (single execution unit) was un-improvable (carry save and high radix
arithmetic) and every architecture had to include it
 This architecture was absolutely universal among programmable architectures,
because any other architecture should include this single execution unit. No other
architecture could work faster, or could have less HW. Usage of more HW ( more
execution units, for example) was not possible because of the main constraint of
available HW
• Basic Architecture Decisions:
 Single Instruction Pointer ISA
 Simple linear memory organization
 No data types support in HW
Input binary includes instructions how to use resources,
rather than the algorithm description
10
Superscalar Era (mid 90s – now) - 1
 Constraints of Superscalar era
– Significant Progress in Si technology, more HW available (HW constraint was
removed), faster execution, but slow memory
– Superscalar still is unable to use efficiently all HW for a single job
– Implicit parallelization, but it requires to convert a linear single IP execution flow into
the parallel form in HW
– The original completion ordering has to be preserved, from parallel execution into the
consecutive retirement (compatibility with the preceding decisions)
– Simple linear memory organization, no support for data types
11
Superscalar Era (mid 90s – now) - 2
 Outcome of this period:
 Sub-optimal functionality (semantics of data and operations)
– Without dynamic data types support in HW it is impossible to implement real high
level programming and true security computing
 Sub-optimal performance
– Programmer doesn’t know the details of rather complicated HW and as a result is
unable to fully control optimizations made by HW
– The compiler does not have all information about the algorithm being compiled
(due to corrupted High-Level languages), and on the other side, the compiler is
too far from the HW and is unable to fully utilize the HW and the internal HW
structures (e.g. caches), which are hidden from the compiler
– Superscalar Hardware is expressed via ISA only (which inherits all obsolete
solutions), no ability to provide the algorithm to such kind of HW, and all HW
machinery (BPU, renaming, cache organization, etc.) is designed to support
compatibility with limited performance improvement
12
New Post-SuperScalar Architecture
(what we call “Best Possible” Computer System)
13
Algorithmically Oriented Post-Superscalar Era
 Changing the angle of view:
– Algorithm of the program itself and data dependency are the real constraints of the performance
and power
– Move HW complexity into SW, free HW from code analysis and parallel conversion (closer to
algorithm representation)
– Move the design into a strongly opposite direction – from resources to algorithms care
14
Constraints in Architecture are the Real Limiter
• Two designs will be considered:
 CONSTRAINED system
– New Architecture (NArch) constrained by compatibility with legacy binaries (x86,
ARM, Power, etc.)
 UNCONSTRAINED system
– Advanced New Architecture (NArch+) without compatibility constraints
(unconstrained), or more precisely – constrained only by the algorithm to be executed,
or by HW resources of the processor
• All past designs have reached their constraints:
– Arithmetic, Early day Single Execution Unit architecture, Superscalar, Functionality of High
level programming
• Therefore, to make the next step we should find some way of how to relax (for
the first case of future architecture), or to remove (for the second case) the
constraints
15
Basic Approach for New Architecture Design
• Let’s first design the best possible unconstrained architecture
• The constrained architecture is going to be just the unconstrained architecture
limited by several mechanisms, required for compatibility support
• So we will get the Best Possible unconstrained and constrained
architectures then!
• Three components must be fully investigated and designed to get the Best
Possible Architecture:
 Language
 Compiler
 Hardware
16
New High-Level Programming Support
 The Compiler should have full information about the algorithm being compiled
 The new programming language should be able to expose the details of the algorithm to the
compiler and, eventually, to HW
 Programmer should optimize only the algorithm, but not execution
 New Language should have the following main features:
– Ability to express the parallel fine-grained structure of the algorithm in perfectly clear and convenient
(for programmer) manner
– Right functionality (semantics) of its elements, including dynamic data types and capability support *)
– Ability to present exhaustive information about the algorithm
*) This feature was completely implemented in EL-76 language used in several generation of Elbrus
computers in Russia
17
Compiler
 Role of compiler:
– Compiler is responsible for all optimizations (not HW)
– To do this it should be model local, which allows it to have all information about model configuration
– It gets all information about the algorithm from the program text after a simple transformation into an
intermediate distributive to be compiled to different computer models
– No information losses during compilation (full algorithm representation)
– Compiler can use some dynamic information from the execution for being able to tune optimizations
dynamically
 The structure of HW elements should be appropriate for good optimizations controlled by the
compiler
 Local to model compiler removes compatibility requirements from HW, as HW can be changed
more freely, if it’s needed to satisfy some requirements (e.g. performance, power, market
segments, etc.)
18
Process of Compilation
 The first-level compiler generates a distributive w/o any optimizations (simple transformation from
source code to data flow graph without information losses)
 The optimizing “real” compiler (distributive, or D-compiler) is model dependent and generates
optimized application code from the app distributive (using dynamic feedback for tuning)
Application
source code
Distributive
App App
Optimizing
D-compiler
System
layer
HW model 1
App App
Optimizing
D-compiler
System
layer
HW model 2
First-level
compilation
(transformation)
19
Requirements for New Architecture Hardware
• Hardware should not do any optimizations (e.g. BPU, prefetching), as it doesn’t
have any information about the algorithm being executed
• Release hardware from the necessity to analyze binaries and extract parallelism
• Hardware should only allocate resources according to compiler instructions
• Hardware should avoid “artificial binding” as Single Instruction Pointer, vectors,
cache lines, full virtual pages, etc.
• Hardware should give the compiler a possibility to change HW configuration for
better optimizations (“Lego Set” HW)
• Hardware should use object oriented memory (like in Elbrus computers)
20
NArch Architecture (constrained compatible case)
• The semantics of legacy binaries cannot be changed due to compatibility requirements
• The only possible relaxation would be to change the way of how this semantics gets
presented to HW in explicit parallel form for execution
• Release hardware from the necessity to analyze binaries and extract parallelism
• Let the software layer be responsible for finding available parallelism and optimizations (via
Binary Translation technology)
• Let HW be responsible for optimal scheduling only (remove unneeded complexity from
hardware and make it simpler) – like in the unconstrained case
• Actually Binary Translation allows using all mechanisms of the unconstrained architecture,
with addition of:
o Memory ordering rules and retirement
o Checkpoint for target context reconstruction and events processing
o Memory renaming technique for memory conflicts resolution in binaries via bigger register file and
special guard HW structure
• Unfortunately, due to semantics compatibility reasons the constrained architecture cannot
support security and aggressive procedure level parallelization
21
Functionality (Semantics)
of Basic Elements
22
 In the constrained architecture functionality (semantics) of all its elements
(data and operations) is strongly determined by compatibility requirements
 But first let’s consider the unconstrained computer system and its elements,
which were developed in accordance with the approach described above.
 Note: All technologies and mechanisms are appropriate for both the constrained
and the unconstrained systems
Method of New Functionality Design
23
Primitive Data Types & Operations
 Primitive data types (HW keeps their types together with the value):
– Potential infinity (integer)
– Potential continuity (floating point)
– Predicates
– Enumerable types (e.g. character)
– Uninitialized data
– Data Descriptor and Functional Descriptor (“auxiliary” data types for technical
operations)
 Primitive Data Types are Dynamic Data Types
– Value is kept together with tag
 Type Safety Approach
– All primitive operations check types of their arguments
24
User Defined Data Types (Objects)
 The “natural” requirements for the new architecture to support language level
functionality, consistent with “abstract algorithm” ideas:
1. Every procedure can generate a new data object and receive a reference to this new
object
2. This procedure, using received reference, can do everything possible with this new object
(read data from this object and update the content, execute this object as a program, and
delete the object)
3. No other procedure can access this object just after it was generated, but this procedure
can give a reference to this object with all or limited rights listed above to anybody it
knows (has a reference to it)
4. Any procedure can generate a copy of reference to any object it’s aware of with decreased
rights
5. After the object has been deleted, nobody can access it (all existing references are invalid)
 Data creation with orientation on objects is an important step for data structuring,
according to semantics of the source algorithm
25
Dangling Pointers and Memory Compaction
 To solve the dangling pointer problem (point 5) we must guarantee that after an object
has been deleted, no one can access the memory occupied by this object.
 The de-allocation procedure frees the physical memory, but not the virtual memory. So
physical memory can be reused, but virtual memory still remains being allocated
 The well-known classical solution is a garbage collection algorithm, but it’s inefficient for
solution of the dangling pointer problem
 When virtual memory gets close to its limit, the system starts compacting the virtual
memory
 The compaction algorithm*):
– Each Data Descriptor is tagged, i.e. there is a special bit in registers and in memory which marks Data
Descriptors
– The system identifies what Data Descriptors are useless (point to objects de-allocated in physical memory)
and replaces them by Uninitialized data, or just re-directs them to non-existent memory page, thus releasing
the virtual pages which the descriptor had pointed to (according to the size of the object)
– The rest of the objects are moved to the vacant virtual memory, and their Data Descriptor’s base address is
replaced by the new virtual address
 This compaction can be fulfilled as a background process
*Note: this compaction algorithm has been implemented in Elbrus-1,2 computers, it can be modified
to make it more efficient
26
Procedures
 Procedure is the fundamental notion of HLL. Every procedure has a reference to its code and context.
 The procedure context consists of the code, global data, parameters/return data, and its local data
 A procedure can be called via Functional Descriptor only (tagged value)
Entry point
address
Global
context
Functional Descriptor
Tag
Global data Procedure
code Local data
1. A procedure can create Functional Descriptor (FD) with a special instruction, providing an entry point
address and a Data Descriptor to some context as arguments, i.e. any procedure can define another
procedure
2. A procedure, which has generated this FD, can give this new FD to anybody it has access, and this
new owner also can call this new procedure via FD
3. A procedure that has generated an FD includes references to the code and global data into this FD
4. A procedure, which got FD of the new procedure, can call this procedure and can pass it some
parameters (atomically).
5. Caller can receive some return data as a result of procedure execution. Data return is logically an
atomic action
6. The called procedure can’t use anything beyond the context it has been provided by the functional
descriptor and the parameters
27
Capability Mechanism
 Only the system that provides type safety allows the correct implementation of the
procedure mechanism. A procedure can be called via Functional Descriptor only
 Procedure has access to its context only. No other procedure can access this
procedure’s context, if it has not been passed as a parameter to that other procedure
 This approach introduces a very strong inter-procedure protection
 Data Descriptor (DD) and Functional Descriptor (FD) is a capability to do something
for the procedure, which has DD or FD in its context:
– DD is a capability to access some object
– and FD is a capability to do something – execute some procedure, which can modify some global data
in the called procedure, the data, which is not directly accessible by caller
 Implementation of some operations, which should work with bit-level representations
of special data types like DD, FD (COMPACTION algorithm is a good example)
sometimes need operation support in HW. All these operations are also primitive
operations; however, only a limited number of procedures should be able to use
them
28
Full Solution of Security Problem
 The described approach does not need a privileged mode for system programming
– E.g. in Elbrus, all programs, including OS, are written as “application” programs
 Capability approach is more powerful and more general than the privileged mode
approach (consistently implemented in Elbrus; no C-list, which is wrong)
 However, even this architecture cannot protect against mistakes in user
programs. Probably, the only possible remedy in this case is possibility to prove
correctness of user and kernel program
– A formal proof of functional correctness was done for seL4 microkernel in 2009 by NICTA group
(National Information and Communications Technology, Australia)
 Even in this case, only the suggested architecture can be helpful to simplify
considerably the proof of program correctness (for both kernel and applications)
29
Implementation of the
Described Functionality
30
Object Oriented Memory (OOM) Structure
 Object oriented memory was initially introduced in Burroughs B5500 computer architecture, but was not
implemented correctly
 All basic principles were carefully designed first in Elbrus 1 (1972-78)
 Present-day memory and cache systems are corrupted by compatibility with linear structure of old
computers. That means that future system should not use a traditional memory and caches organization,
which excludes compiler from applying efficient optimizations
 OOM structure even for constrained architecture (according to preliminary estimations) can decrease
cache sizes by up to 2-3 times and nearly exclude performance losses due to cache misses
 Object oriented physical memory approach:
– The size of physical memory allocated for an object is equal to the object size
– Each allocated object is also loaded in the virtual space with pages of fixed size
– Each new object in virtual space is allocated starting from a new page contiguously (if the size of the object is
smaller than the page size, then the end of the virtual space of this page is empty)
Virtual Memory
Object N
EMPTY
EMPTY
Object M
Physical
Memory
31
 OOM uses virtual numbers of the objects instead of virtual memory addresses
 Virtual page number is allocated sequentially during each object generation
 There exists a system register, which keeps the next still free object number being
used for the next object being generated
 We will use sometimes the expression “virtual address” having in mind “virtual
number”.
Object’s virtual
number N
Index
Virtual Page N (1)
Virtual Page N (2)
Virtual Address
Object N
EMPTY
Next Object Number SysReg
N+1
Object Oriented Memory: Objects Naming Rules
32
Allocation of Objects and Sub-objects in Caches
 Unlike today’s TLB, used in contemporary computers, in this OOM architecture TLB
translates virtual address not into memory physical address, but directly into physical
location in some specific cache, where this piece of data is located
 In each specific cache, as well as in memory, the new architecture does not use cache
lines (like superscalar does)
 Object’s parts allocated on cache levels are split into smaller parts, and all these parts
belong to the same virtual page
 Each cache level could have its own small TLB
33
Generation of an Object
 A special instruction in HW is used to generate an object (no SW library calls, as e.g.
malloc, no OS system calls)
 The list of all occupied spaces is contained in TLB, and the system supports special lists for
all free spaces. Each free-list maintains the free areas of a certain set of the sizes (more
likely of power of 2)
 For physical address allocation, HW should take physical address from one of the free-lists
(the first empty chunk from the corresponding list - also from a special HW register)
 The result of the instruction execution is the corresponding Data Descriptor.
GENOBJ
Object Type
Object Size
Free-lists
2
4
8
2N
Data Descriptor
Note: Links are located inside free memory chunks
34
The Compiler Controls OOM Usage
 This memory/cache system organization allows the compiler to have a strong control of
execution process
 Compiler is aware of all program semantics information and can perform more sophisticated
optimizations
 Compiler can preload the needed data to high-level cache, at first without assigning a more
precious register memory, and can move these data from cache to registers only at the last
moment. But now even preloading directly into the registers sometimes could be a good
alternative – now we have a big register file.
 This cache organization allows using access to the first level cache directly from an
instruction by physical addresses without using virtual address and associative search.
 To do this, the base register (BR) can support a special mode, in which it includes pointers
to the physical location of the first level cache together with its virtual address.
35
Explicitly Parallel Instruction Execution in NArch+
 In NArch+ architecture all mutually independent executable objects can be
executed in parallel to each other. This includes:
– Operations
– Chains of dependent operations inside scalar and/or iterations of loop code
– Procedures
– Jobs
 NArch+ overcomes difficulties and constraints of Data Flow and Single IP
approaches, excludes any “artificial binding” in HW (program is a parallel
graph)
 Two different approaches have been investigated in NArch+ for program data
graph execution: strands and streams (see next slides)
36
STRANDs Oriented Architecture
HW scheduler
RFEXEC
Parallel HW
• Strands express parallelism via chains of data dependent (mainly) operations (in more natural
way than e.g. in VLIW) and provide new opportunity for presenting parallelism to OoO HW
• Simple instruction scheduling for parallel execution
– Need to look only at the oldest instructions in each Strand (much smaller and simpler RS)
• Strands also provide:
– Bigger effective instruction window
– Reduced register usage (via intra-strand accumulators)
– Wider instruction issue width (via clustering with register-to-register communication)
• Adding ability to express parallelism in uISA gives additional advantages, e.g. superior control
over speculation and control over power, better HW utilization, much more opportunities for
optimizations, and for resolving the memory latency issue
HW scheduler
EXEC RF
HW scheduler
EXEC RF
Cluster 1 Cluster 2
Inter
connect
Original data
graph
IP1
IP2
IP3
strands
37
 Strands are extracted from the program data graph by the compiler
 Each strand is executed by HW in-order, but out-of-order relative to each other
 HW allocates a set of resources for each active strand (called WAY)
 The compiler creates a strand via special FORK operation, which takes a free WAY for the
strand execution
 BUT the compiler has to be aware of the number of WAYs available in HW and to schedule
strands accordingly. Otherwise there could be a deadlock situation (e.g. no free way to
spawn new strands, and other strands are waiting for some result from this new strand)
 Having Strand (WAY) as a resource for the compiler potentially limits parallelism
Drawbacks of the STRANDs Architecture
Way 0 Way 2Way 1
FORK A
A:
B:
FORK B
38
DL/CL Mechanism for Register/Predicate Reuse
 Definition-Line (DL):
– Definition Line L is a group of DL-instructions in different streams, which
form an explicit DL-front dividing streams into intervals
– DL-front crosses all alive streams according to possible timing analysis.
Fronts are successive – no cross each other
 Check-Line (CL):
– Check Line (CL) is a group of CL-instructions suspending execution of
some streams until the specified DL-front is completely passed
– After that a corresponding register/predicate resource can be safely reused
A
+DL
B
C
D
E
G
H
I
K
L
M
N
O
P
Q
+DL +DL
+DL +DLF
R
S
time
CL -2
39
Intelligent Branch Processing
– Conventional: Branch predict one
path, discard everything when wrong
– New Architecture: Speculate when
necessary, discard only misspeculated
work
– Increases performance
– Reduces wasted energy due to
misspeculation
– According to our statistics, 80% of
branches are not critical and can be
executed without speculation
40
STREAMs Oriented Architecture
Streams and How They Get Created
• First let’s describe the simplest case, when an algorithm to be executed is a scalar by its nature (acyclic
data-dependency graph) without conditional braches
• Let’s have the total number of operations equal to the number of available registers (single assignment,
no register reuse)
• For this simple case:
– No decoding stage (each instruction is ready to be loaded into the corresponding execution unit, the compiler
prepares the code)
– For each instruction in the graph the compiler calculates a “Priority Value Number” (PVN). This number is the
number of clocks from this instruction up to the end of the graph along the longest path. Compiler will present the
code in a number of sequences of dependent instructions - “streams”
– As the first instruction in the new stream, the compiler takes an instruction with the highest PVN, not included
yet into any other stream. For each next instruction in this stream, the compiler again selects an instruction with
the highest PVN, data dependent on the previous instruction in the stream. And so on, until the stream reaches
either the end of the scalar code, or until it gets into some other stream.
Streams decompositionData Dependency graph
Stream 1 Stream 2
Stream 3
41
Scalar Code Execution With STREAMs
Execution Engine (Workers)
 Register File:
– Each register has an EMPTY/FULL bit (EMPTY - to prevent from reading the register, when value is not ready yet, and
FULL – to prevent from writing to the register, when not all dependent instructions have consumed the value)
– Each register has an additional bit showing, if an operation generating the value for this register has been already sent to
an execution unit (EU) or is in the Reservation Station (RS)
 Main scheduling and execution mechanisms for Streams are “workers” (16 per cluster)
 How the workers work:
– Workers issue ready instructions to the RS/Execution units (the arguments are FULL, or predecessors are in the RS/EU)
– Each register has a list of streams, waiting for the result in this register
– If a waiting stream is ready for execution (the value is ready), it gets moved to the “waiting for a free worker” queue
– A free worker takes an instruction from the “waiting for workers queue” or from the Instruction Buffer
– If an argument of the next instruction in the stream is not ready yet, the worker stops executing this stream and puts it
into the waiting queue for this argument (register)
42
NArch+: Scalar Code Execution
More Complex Case (Bigger Code)
 If scalar code is big enough, the DL/CL technique is applied for registers reuse to guarantee
correct dynamic execution of streams and optimal utilization of the Instruction Buffer
 When the code before CLN has been executed, it is necessary to preload the next part of the code
between CLN and CLN+1. Similarly, when DLN is crossed, all code area above can be freed
 The size of code between CLN and CLN+1 is not bigger than the size of the Register File
 Time of execution can be improved with the help of the Dynamic Feedback mechanism (both in
HW and SW)
 If there are conditional branches in the code, the compiler uses speculative streams to handle
these cases efficiently (predicated streams and GATE instruction to check predicate value and to
kill one of the streams in case of wrong speculation)
 More details on speculation techniques (e.g. load/store speculation, efficient branch handling
without branch prediction) would require more low-level micro-architecture details. Alas!
 This scalar technology is nearly the same both for constrained and unconstrained versions of the
architecture
This scalar code execution technique is a practical
implementation of Data Flow architecture
43
Summary: Strands vs. Streams
 Strands
HW scheduler
RFEXEC
Parallel HW
Original program graph
The mechanism of strands execution (one way per strand) is
visible to the compiler, so the compiler has to watch how many
strands are going to be executed by HW at each moment, and
the number is limited by the number of ways
Ways in HW
 Streams
Cons: can lead to deadlock, limits parallelism due to explicit
resource (ways) scheduling by the compiler
The compiler can create any number of streams, the
mechanism of streams execution is not visible to the compiler
Pro: No deadlock, HW executes the original graph, natural data
flow execution mechanism
Original program graph
HW scheduler
RFEXEC
Parallel HW
Workers
RS
44
NArch+: Code with Loops
 Use loop iteration parallelism (both iteration internal and inter-
iteration) as fully as possible
 Loop iterations analysis performed by the compiler:
– Find instructions, which are self-dependent over iteration
– Find the groups of instructions, which being self-dependent, are also
mutually dependent over the iterations (“rings” of data dependency)
– The rest of instructions create sequences or graph of dependent
instructions (a number of “rows”)
– The result of each row is either an output of the iteration (STORE, for
example), or is used by another row(s) or ring(s).
 Each “ring” and/or “row” loop is producing data, which are
consumed by other small loops. Each producer can have a
number of consumers. However, producer and consumer should
be connected through a buffer, giving possibility for producer to
go forward, if consumer is not ready yet to use these data
45
Loops Handling in NArch+
 Differences between NArch and NArch+ in loops implementation:
– NArch+ does not need to support compatibility with Single IP approach; therefore, many different
loops can be executed together (even a “single” loop can also be executed out-of-order)
– NArch+ has a simple memory system without speculative buffers; therefore, in some cases
(speculations only) it is necessary to use some other mechanisms and a new HW support
 Types of loops, handled by NArch+:
– RECURRENT loop (including WHILE loop)
– DO ALL (trip count is known before the loop start)
– DO ALL (trip count becomes known during the loop execution only)
– Loop with low probable “maybe” dependence between iterations (through memory) (including
WHILE loop)
– Loop with “maybe” data dependence within iterations
46
Parallel Procedure Execution
• For constrained architecture, procedure can be executed on a different number of
clusters, but no more than four.
• Compiler will try to do in-lining of as many called procedures as possible to be able
to use the resulted procedure parallelism in full degree.
• As usual for constrained case, caller will wait for the end of called procedure and
will work with the same resources.
• Call, as well as return, is logically atomic step, however, to increase performance
using DL/CL technology there will be prolog and epilog areas, where both caller
and callee are working together without interfering with each other.
• In unconstrained architecture, new HLL allows parallel procedures execution, but
again each procedure will use no more than four clusters.
• If some procedure has DO ALL loop inside, this loop can use all available HW
(many, up to all clusters in chip - ~60 today).
47
All Basic Parts of Computer Technology and
Their Current Status
48
NArch/IA Architecture (IA compatible case study)
 NArch/IA is x86 compatible new micro-architecture based on strands approach
– NArch Strand – a sequence of (usually dependent, but can include control flow) operations with its own
IP; strands are executed out-of-order, in parallel
– BT parses IA binaries, extracts strands and provides them to HW for scheduling and execution
– Multiple strands allow overlapping of memory accesses (thus improving memory latency)
 A fairly wide CPU due to scalable clustering
– One or two bi-clusters (up to 4 clusters and 24 instructions issue width – 16 strands per cluster)
– Clusters are tightly-coupled (register-to-register communication and synchronization)
 Very large sparse instruction window
– Much larger than in conventional superscalar (~1K instructions)
– Branch resolution in large window (no HW branch predictor)
– Memory disambiguation in large window
– Smart retirement in large window (no retirement for registers)
 Binary Translation for IA compatibility and enabling
NArch uarch
– Dynamic and static BT for maximum ST/MT performance and
efficiency
 Highly parameterized architecture (scalability)
– Variable number of clusters/strands per cluster
– Dynamically reconfigurable machine (ST/MT)
Result is higher Performance and lower power at the same time
49
Advantages of the New Architecture
Compatible (constrained) case
• This approach can ensure full compatibility with some of existing binaries (ARM,
x86, POWER, RISC-V, etc.) or even with all of them on the same HW with the
help of Binary Translation
• Preliminary investigations allow us to do the following rather reliable
predictions:
– A compatible version (NArch) can reach the best possible, un-improvable performance
restricted by binary semantics constraints (not by binary’s sequential presentation) and amount
of resources available for specific model only
– ~3x-4x ST performance @ unconstrained power vs. OOO Core
– ~2x ST performance @ iso-power
– Less than ~50% of power @ iso-performance
– ~2x MT performance @ iso-power vs OOO Core
50
Advantages of the New Architecture
Un-Compatible (unconstrained) case
• If we release HW architecture from the requirement to maintain
compatibility with old style programming, then:
– We can significantly simplify the architecture (e.g. 70-75% of constrained architecture has the
burden of maintaining compatibility with SS)
– Introduce explicit parallelism in programming languages to expose the algorithm structure to HW
more easily
– Introduce security in HW (tagged architecture) and, eventually, get rid of viruses and make
programming safe and reliable
– Get rid of obsolete cache memory hierarchy (object oriented memory)
– Eventually, increase significantly the performance (up to 5x-7x or even more)
– Improve scalability and universality (new distributive, HW model-oriented compiler)
– Build absolutely un-improvable computer architecture
• As a result of high universality of this architecture we can hope that now
all special applications like machine learning, computer vision, graphics
will be supported well with high performance
51
T H A N K Y O U !
Q & A
52Intel Labs Joint Pathfinding
Backup Slides
53
 Each TLB entry besides helping to translate virtual address into physical data location can
include also some documentation of referenced object: its size, its user data type (Object Type
Name - OTN), and maybe some other information
 It also includes references to more detailed tables of physical locations of all elements of this
object in cache(s)
 Each object should not be necessarily presented in memory. Some objects can be generated,
for example, in DCU (Level 1cache) only
Access
rights
Object
Number
Sub-object
information
TLB Entry
Data Descriptor
Object
Size
Object
Type
Physical
location
TLB Entry
TLB
TLB Entry
DCU
LLC
MLC
Physical Memory
+
Index
Object or
sub-object
Object Oriented Memory: TLB Structure
54
Advantages of Object Oriented Memory System
 Unlike superscalar, OOM memory/cache system is visible to compiler – no uncontrollable
physical pages, lines, cache structure hidden from the compiler. This helps significantly to
improve the efficiency
 Explicit object oriented structure helps to increase efficiency of memory usage. All free
memory is explicitly visible to the compiler and HW
 Ability to access the first level cache using physical addresses directly from instructions
promises a huge increase in efficiency
 Inexpensive memory allocation (without OS and library calls) also helps to increase
efficiency and makes it simple to design Operating System
 Eviction process is explicitly controlled by the compiler
 Compiler has full knowledge of cache structure and can make nearly all procedure-local
data as resident in the first level cache and can make them accessible by physical
addresses, this substantially will decrease cache misses.
 Cache size will also be reduced
 The compiler can control objects and sub-objects allocation and preloading
55
STREAMs Oriented Architecture
Removing Drawbacks of STRANDs Approach
 Get the maximum parallelism available in Program Data Graph and execute Graph itself
 Still chains of data dependent operations are presented to HW, but they are just hints -
STREAMs, not the real resource
 New mechanism of STREAMs execution – WORKERs
 No deadlocks anymore as streams are not the static scheduling resource in the compiler
(any number of streams), HW “workers” dynamically choose operations from the ready
streams and dispatch them to the Reservation Station for execution
 More details on next slides…
1 3
2 4
7
5
8
9
11
10
6
Program Data Graph
WORKERs
74
2
138
1
3
7
Reservation
Station
Execution
1
3
7
8
2
2
8
STREAMs

More Related Content

What's hot

Xilinx fpga cores
Xilinx fpga coresXilinx fpga cores
Xilinx fpga cores
sanaz nouri
 
Training Lecture
Training LectureTraining Lecture
Training Lecture
iuui
 

What's hot (20)

VLIW Processors
VLIW ProcessorsVLIW Processors
VLIW Processors
 
Functional approach to packet processing
Functional approach to packet processingFunctional approach to packet processing
Functional approach to packet processing
 
Hyperscan - Mohammad Abdul Awal
Hyperscan - Mohammad Abdul AwalHyperscan - Mohammad Abdul Awal
Hyperscan - Mohammad Abdul Awal
 
Modern Software Architecture
Modern Software Architecture Modern Software Architecture
Modern Software Architecture
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
 
Performance challenges in software networking
Performance challenges in software networkingPerformance challenges in software networking
Performance challenges in software networking
 
Introduction to eBPF
Introduction to eBPFIntroduction to eBPF
Introduction to eBPF
 
OpenDataPlane - Bill Fischofer
OpenDataPlane - Bill FischoferOpenDataPlane - Bill Fischofer
OpenDataPlane - Bill Fischofer
 
OpenFlow
OpenFlowOpenFlow
OpenFlow
 
Developer's Guide to Knights Landing
Developer's Guide to Knights LandingDeveloper's Guide to Knights Landing
Developer's Guide to Knights Landing
 
Xilinx fpga cores
Xilinx fpga coresXilinx fpga cores
Xilinx fpga cores
 
HPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance OptimizationHPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance Optimization
 
Architecture of OpenFlow SDNs
Architecture of OpenFlow SDNsArchitecture of OpenFlow SDNs
Architecture of OpenFlow SDNs
 
Dpdk Validation - Liu, Yong
Dpdk Validation - Liu, YongDpdk Validation - Liu, Yong
Dpdk Validation - Liu, Yong
 
Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6
 
Introduction to architecture exploration
Introduction to architecture explorationIntroduction to architecture exploration
Introduction to architecture exploration
 
SDN Project PPT
SDN Project PPTSDN Project PPT
SDN Project PPT
 
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUB
 
ONOS: Open Network Operating System. An Open-Source Distributed SDN Operating...
ONOS: Open Network Operating System. An Open-Source Distributed SDN Operating...ONOS: Open Network Operating System. An Open-Source Distributed SDN Operating...
ONOS: Open Network Operating System. An Open-Source Distributed SDN Operating...
 
Training Lecture
Training LectureTraining Lecture
Training Lecture
 

Viewers also liked

TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...
chiportal
 
台積電
台積電台積電
台積電
5045033
 

Viewers also liked (9)

SDN & NFV: от абонента до Internet eXchange
SDN & NFV: от абонента до Internet eXchangeSDN & NFV: от абонента до Internet eXchange
SDN & NFV: от абонента до Internet eXchange
 
TSM
TSMTSM
TSM
 
TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...
 
WIDER Annual Lecture 20 – Martin Ravallion
WIDER Annual Lecture 20 – Martin RavallionWIDER Annual Lecture 20 – Martin Ravallion
WIDER Annual Lecture 20 – Martin Ravallion
 
Отечественные решения на базе SDN и NFV для телеком-операторов
Отечественные решения на базе SDN и NFV для телеком-операторовОтечественные решения на базе SDN и NFV для телеком-операторов
Отечественные решения на базе SDN и NFV для телеком-операторов
 
Практическое применение SDN/NFV в современных сетях: от CPE до Internet eXchange
Практическое применение SDN/NFV в современных сетях: от CPE до Internet eXchangeПрактическое применение SDN/NFV в современных сетях: от CPE до Internet eXchange
Практическое применение SDN/NFV в современных сетях: от CPE до Internet eXchange
 
EZchip Open Flow switch by ARCCN
EZchip Open Flow switch by ARCCN  EZchip Open Flow switch by ARCCN
EZchip Open Flow switch by ARCCN
 
RUNOS OpenFlow controller (ru)
RUNOS OpenFlow controller (ru)RUNOS OpenFlow controller (ru)
RUNOS OpenFlow controller (ru)
 
台積電
台積電台積電
台積電
 

Similar to A Perspective on the Future of Computer Architecture

Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH
veena babu
 
Introduction to NetBSD kernel
Introduction to NetBSD kernelIntroduction to NetBSD kernel
Introduction to NetBSD kernel
Mahendra M
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design space
jsvetter
 
Ap 06 4_10_simek
Ap 06 4_10_simekAp 06 4_10_simek
Ap 06 4_10_simek
Nguyen Vinh
 
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Michael Christofferson
 

Similar to A Perspective on the Future of Computer Architecture (20)

Chap 2 classification of parralel architecture and introduction to parllel p...
Chap 2  classification of parralel architecture and introduction to parllel p...Chap 2  classification of parralel architecture and introduction to parllel p...
Chap 2 classification of parralel architecture and introduction to parllel p...
 
CSC204PPTNOTES
CSC204PPTNOTESCSC204PPTNOTES
CSC204PPTNOTES
 
5-Embedded processor technology-06-01-2024.pdf
5-Embedded processor technology-06-01-2024.pdf5-Embedded processor technology-06-01-2024.pdf
5-Embedded processor technology-06-01-2024.pdf
 
Lect_1.pptx
Lect_1.pptxLect_1.pptx
Lect_1.pptx
 
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
 
Construct an Efficient and Secure Microkernel for IoT
Construct an Efficient and Secure Microkernel for IoTConstruct an Efficient and Secure Microkernel for IoT
Construct an Efficient and Secure Microkernel for IoT
 
esunit1.pptx
esunit1.pptxesunit1.pptx
esunit1.pptx
 
F9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded SystemsF9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
F9: A Secure and Efficient Microkernel Built for Deeply Embedded Systems
 
Introduction to embedded computing and arm processors
Introduction to embedded computing and arm processorsIntroduction to embedded computing and arm processors
Introduction to embedded computing and arm processors
 
Lab6 rtos
Lab6 rtosLab6 rtos
Lab6 rtos
 
Introduction to multicore .ppt
Introduction to multicore .pptIntroduction to multicore .ppt
Introduction to multicore .ppt
 
Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH
 
Parallex - The Supercomputer
Parallex - The SupercomputerParallex - The Supercomputer
Parallex - The Supercomputer
 
Introduction to NetBSD kernel
Introduction to NetBSD kernelIntroduction to NetBSD kernel
Introduction to NetBSD kernel
 
ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design space
 
Mastering Real-time Linux
Mastering Real-time LinuxMastering Real-time Linux
Mastering Real-time Linux
 
Ap 06 4_10_simek
Ap 06 4_10_simekAp 06 4_10_simek
Ap 06 4_10_simek
 
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
 

More from ARCCN

More from ARCCN (20)

Построение транспортных SDN сетей для операторов связи
Построение транспортных SDN сетей для операторов связиПостроение транспортных SDN сетей для операторов связи
Построение транспортных SDN сетей для операторов связи
 
Магистерская программа «Распределённые системы и компьютерные сети»
Магистерская программа «Распределённые системы и компьютерные сети»Магистерская программа «Распределённые системы и компьютерные сети»
Магистерская программа «Распределённые системы и компьютерные сети»
 
Особенности интеграции сторонних сервисов в облачную MANO платформу
Особенности интеграции сторонних сервисов в облачную MANO платформуОсобенности интеграции сторонних сервисов в облачную MANO платформу
Особенности интеграции сторонних сервисов в облачную MANO платформу
 
Основные направления развития ФГБОУ ВО «РГРТУ» в области программно-конфигури...
Основные направления развития ФГБОУ ВО «РГРТУ» в области программно-конфигури...Основные направления развития ФГБОУ ВО «РГРТУ» в области программно-конфигури...
Основные направления развития ФГБОУ ВО «РГРТУ» в области программно-конфигури...
 
Методика стратегического управления развитием SDN&NFV-сети оператора связи и ...
Методика стратегического управления развитием SDN&NFV-сети оператора связи и ...Методика стратегического управления развитием SDN&NFV-сети оператора связи и ...
Методика стратегического управления развитием SDN&NFV-сети оператора связи и ...
 
Перспективы развития SDN  в МИЭТ на базе кафедры ТКС
Перспективы развития SDN  в МИЭТ на базе кафедры ТКСПерспективы развития SDN  в МИЭТ на базе кафедры ТКС
Перспективы развития SDN  в МИЭТ на базе кафедры ТКС
 
MetaCloud Computing Environment
MetaCloud Computing EnvironmentMetaCloud Computing Environment
MetaCloud Computing Environment
 
Пилотные зоны для тестирования и апробирования SDN&NFV разработок и решений в...
Пилотные зоны для тестирования и апробирования SDN&NFV разработок и решений в...Пилотные зоны для тестирования и апробирования SDN&NFV разработок и решений в...
Пилотные зоны для тестирования и апробирования SDN&NFV разработок и решений в...
 
Возможности импортозамещения коммутационного оборудования в сетях нового пок...
Возможности импортозамещения коммутационного оборудования  в сетях нового пок...Возможности импортозамещения коммутационного оборудования  в сетях нового пок...
Возможности импортозамещения коммутационного оборудования в сетях нового пок...
 
Внедрение SDN в сети телеком-оператора
Внедрение SDN в сети телеком-оператораВнедрение SDN в сети телеком-оператора
Внедрение SDN в сети телеком-оператора
 
Об одном подходе переноса функциональности CPE устройств в ЦОД телеком оператора
Об одном подходе переноса функциональности CPE устройств в ЦОД телеком оператораОб одном подходе переноса функциональности CPE устройств в ЦОД телеком оператора
Об одном подходе переноса функциональности CPE устройств в ЦОД телеком оператора
 
Облачная платформа Cloud Conductor
Облачная платформа Cloud ConductorОблачная платформа Cloud Conductor
Облачная платформа Cloud Conductor
 
Типовые сервисы региональной сети передачи данных
Типовые сервисы региональной сети передачи данныхТиповые сервисы региональной сети передачи данных
Типовые сервисы региональной сети передачи данных
 
Разработка OpenFlow-коммутатора на базе сетевого процессора EZchip
Разработка OpenFlow-коммутатора на базе сетевого процессора EZchipРазработка OpenFlow-коммутатора на базе сетевого процессора EZchip
Разработка OpenFlow-коммутатора на базе сетевого процессора EZchip
 
Исследования SDN в Оренбургском государственном университете: сетевая безопас...
Исследования SDN в Оренбургском государственном университете: сетевая безопас...Исследования SDN в Оренбургском государственном университете: сетевая безопас...
Исследования SDN в Оренбургском государственном университете: сетевая безопас...
 
Цели и задачи МИЭТ, как участника Консорциума на примере кафедры "Телекоммуни...
Цели и задачи МИЭТ, как участника Консорциума на примере кафедры "Телекоммуни...Цели и задачи МИЭТ, как участника Консорциума на примере кафедры "Телекоммуни...
Цели и задачи МИЭТ, как участника Консорциума на примере кафедры "Телекоммуни...
 
SDN и защищенные квантовые коммуникации
SDN и защищенные квантовые коммуникацииSDN и защищенные квантовые коммуникации
SDN и защищенные квантовые коммуникации
 
Отчет по проектах ЦПИКС
Отчет по проектах ЦПИКСОтчет по проектах ЦПИКС
Отчет по проектах ЦПИКС
 
Учебно-методическая работа по тематике ПКС и ВСС
Учебно-методическая работа по тематике ПКС и ВССУчебно-методическая работа по тематике ПКС и ВСС
Учебно-методическая работа по тематике ПКС и ВСС
 
Отчет «Центра прикладных исследований компьютерных сетей» на Совете фонда "Ск...
Отчет «Центра прикладных исследований компьютерных сетей» на Совете фонда "Ск...Отчет «Центра прикладных исследований компьютерных сетей» на Совете фонда "Ск...
Отчет «Центра прикладных исследований компьютерных сетей» на Совете фонда "Ск...
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

A Perspective on the Future of Computer Architecture

  • 1. 11 Boris Babayan Intel Fellow October 2016 A Perspective on the Future of Computer Architecture
  • 2. 2 Agenda  My background building Real Computers  Challenges with today’s Superscalar Computers  Lessons and Proposals for Future Computers – Constrained Designs: ie backwards compatible with pragmatic compromises – Lessons from the last several years at Intel – Unconstrained Designs: Unlocking more performance potential  Conclusions
  • 3. 3 My experience building real computers  Carry Save Arithmetic – In 1954 I developed “Carry Save Arithmetic” (for multiplication, division and square root) as my student project, and presented at a Russian conference in 1955 – Precedes the first western publication of CSA by M. Nadler was in Acta Technica journal (1956)  Chief architect of Elbrus-1, Elbrus-2, and Elbrus-3 line of supercomputers – My team built Elbrus-line computers (1978-90) widely used in Russia, eg for space program, etc. – High level programming language support put in hardware (not just support of the existing HLL corrupted by outdated architecture) – still not implemented so far in other computers – High Level Language EL–76 for Elbrus-line computers – Elbrus OS kernel had support for real High Level programming  One of first complete security solutions – Elbrus architecture, the main goal of which is real HLL EL–76 support, and Elbrus OS kernel as a byproduct, fully solved security problems, including the possibility to prove the correctness of user- level programs.
  • 4. 4 My experience building real computers (continued) • First industrial implementation of an Out-of-Order superscalar computer – Elbrus 1 (implemented in 1978) was the first commercial implementation of OoO superscalar in the world (two-wide issue computer) – After the second generation of Elbrus computers in 1985, our team realized many weaknesses with superscalar approach and started looking for more robust solution of the parallel execution problem, leading us to VLIW. • Elbrus-3: A Very Long Instruction Word (VLIW) computer – Successful implementation of cluster-based VLIW architecture with fine grained parallel execution (Elbrus 3, end of 90s), probably for the first time in industry • Hardware assisted Binary Translation – Suggestion and the first implementation of Binary Translation (BT) technology for designing a new architecture, built on radically new principles, but binary compatible with the old ones (Elbrus 3, end of 90s). • Fine-grained parallel architecture – Design and simulation of radically new principles of fine-grained parallel architecture and extension of HLL (like EL – 76) and OS (like Elbrus OS kernels) for their support.
  • 6. 6 Drawbacks of Superscalar Paradigm - 1  Drawbacks of Superscalar architecture – Program conversion is rather complicated (parallel->sequential->parallel) – Superscalar architecture has a performance limit (regardless of available HW) – Inability to use properly all available HW – Even SMT mode cannot significantly improve efficiency (but decreases cache utilization efficiency instead) – Rather complicated VECTOR HW and MULTI-THREAD programming have to be used to compensate somehow for this performance limit – Today’s High-level languages (HLL) mirror the old and present-day architectures (linear data space, no explicit parallelism). As a result, current architecture has corrupted all today’s HLLs – Current organization of computations does not allow for good optimizations (necessary to have full information about the algorithm to be executed, and hardware, which will execute it) – Non-universal architecture
  • 7. 7 Drawbacks of Superscalar Paradigm -2  Memory and caches organization – Current architecture does not support object oriented data memory. – This excludes possibility to support true security computing and debugging facility – Cache organization of today’s architecture hides its internal structure, preventing the compiler to do good optimizations. This has been made for compatibility with the simple linear memory organization in older computers Superscalar architecture today is very close to un-improvable state, including all the above mentioned drawbacks All the above-mentioned drawbacks have the single source – inheriting of principles of ancient, early days computing with strong HW size constraints for current architecture as its basic ones
  • 8. 8 Beginning of Computer Era (early 50s – mid 90s) - 1  Single execution unit era – Amount of available HW was the main constraint – Single IP, single execution unit, linear memory of small size – Performance is just a number of executed operations (fast memory vs. ops execution time) – Binary programming was the most efficient method – The programmer was responsible for all optimizations as he knew both the algorithm and available HW resources. HW was very simple at that time, so the programmer was able to fulfil this job very well – The only reasonable HW improvement was the possibility to improve this single execution unit
  • 9. 9 Beginning of Computer Era (early 50s – mid 90s) - 2 • General results for that period architecture:  This architecture was un-improvable with corresponding constraints, because the main resource (single execution unit) was un-improvable (carry save and high radix arithmetic) and every architecture had to include it  This architecture was absolutely universal among programmable architectures, because any other architecture should include this single execution unit. No other architecture could work faster, or could have less HW. Usage of more HW ( more execution units, for example) was not possible because of the main constraint of available HW • Basic Architecture Decisions:  Single Instruction Pointer ISA  Simple linear memory organization  No data types support in HW Input binary includes instructions how to use resources, rather than the algorithm description
  • 10. 10 Superscalar Era (mid 90s – now) - 1  Constraints of Superscalar era – Significant Progress in Si technology, more HW available (HW constraint was removed), faster execution, but slow memory – Superscalar still is unable to use efficiently all HW for a single job – Implicit parallelization, but it requires to convert a linear single IP execution flow into the parallel form in HW – The original completion ordering has to be preserved, from parallel execution into the consecutive retirement (compatibility with the preceding decisions) – Simple linear memory organization, no support for data types
  • 11. 11 Superscalar Era (mid 90s – now) - 2  Outcome of this period:  Sub-optimal functionality (semantics of data and operations) – Without dynamic data types support in HW it is impossible to implement real high level programming and true security computing  Sub-optimal performance – Programmer doesn’t know the details of rather complicated HW and as a result is unable to fully control optimizations made by HW – The compiler does not have all information about the algorithm being compiled (due to corrupted High-Level languages), and on the other side, the compiler is too far from the HW and is unable to fully utilize the HW and the internal HW structures (e.g. caches), which are hidden from the compiler – Superscalar Hardware is expressed via ISA only (which inherits all obsolete solutions), no ability to provide the algorithm to such kind of HW, and all HW machinery (BPU, renaming, cache organization, etc.) is designed to support compatibility with limited performance improvement
  • 12. 12 New Post-SuperScalar Architecture (what we call “Best Possible” Computer System)
  • 13. 13 Algorithmically Oriented Post-Superscalar Era  Changing the angle of view: – Algorithm of the program itself and data dependency are the real constraints of the performance and power – Move HW complexity into SW, free HW from code analysis and parallel conversion (closer to algorithm representation) – Move the design into a strongly opposite direction – from resources to algorithms care
  • 14. 14 Constraints in Architecture are the Real Limiter • Two designs will be considered:  CONSTRAINED system – New Architecture (NArch) constrained by compatibility with legacy binaries (x86, ARM, Power, etc.)  UNCONSTRAINED system – Advanced New Architecture (NArch+) without compatibility constraints (unconstrained), or more precisely – constrained only by the algorithm to be executed, or by HW resources of the processor • All past designs have reached their constraints: – Arithmetic, Early day Single Execution Unit architecture, Superscalar, Functionality of High level programming • Therefore, to make the next step we should find some way of how to relax (for the first case of future architecture), or to remove (for the second case) the constraints
  • 15. 15 Basic Approach for New Architecture Design • Let’s first design the best possible unconstrained architecture • The constrained architecture is going to be just the unconstrained architecture limited by several mechanisms, required for compatibility support • So we will get the Best Possible unconstrained and constrained architectures then! • Three components must be fully investigated and designed to get the Best Possible Architecture:  Language  Compiler  Hardware
  • 16. 16 New High-Level Programming Support  The Compiler should have full information about the algorithm being compiled  The new programming language should be able to expose the details of the algorithm to the compiler and, eventually, to HW  Programmer should optimize only the algorithm, but not execution  New Language should have the following main features: – Ability to express the parallel fine-grained structure of the algorithm in perfectly clear and convenient (for programmer) manner – Right functionality (semantics) of its elements, including dynamic data types and capability support *) – Ability to present exhaustive information about the algorithm *) This feature was completely implemented in EL-76 language used in several generation of Elbrus computers in Russia
  • 17. 17 Compiler  Role of compiler: – Compiler is responsible for all optimizations (not HW) – To do this it should be model local, which allows it to have all information about model configuration – It gets all information about the algorithm from the program text after a simple transformation into an intermediate distributive to be compiled to different computer models – No information losses during compilation (full algorithm representation) – Compiler can use some dynamic information from the execution for being able to tune optimizations dynamically  The structure of HW elements should be appropriate for good optimizations controlled by the compiler  Local to model compiler removes compatibility requirements from HW, as HW can be changed more freely, if it’s needed to satisfy some requirements (e.g. performance, power, market segments, etc.)
  • 18. 18 Process of Compilation  The first-level compiler generates a distributive w/o any optimizations (simple transformation from source code to data flow graph without information losses)  The optimizing “real” compiler (distributive, or D-compiler) is model dependent and generates optimized application code from the app distributive (using dynamic feedback for tuning) Application source code Distributive App App Optimizing D-compiler System layer HW model 1 App App Optimizing D-compiler System layer HW model 2 First-level compilation (transformation)
  • 19. 19 Requirements for New Architecture Hardware • Hardware should not do any optimizations (e.g. BPU, prefetching), as it doesn’t have any information about the algorithm being executed • Release hardware from the necessity to analyze binaries and extract parallelism • Hardware should only allocate resources according to compiler instructions • Hardware should avoid “artificial binding” as Single Instruction Pointer, vectors, cache lines, full virtual pages, etc. • Hardware should give the compiler a possibility to change HW configuration for better optimizations (“Lego Set” HW) • Hardware should use object oriented memory (like in Elbrus computers)
  • 20. 20 NArch Architecture (constrained compatible case) • The semantics of legacy binaries cannot be changed due to compatibility requirements • The only possible relaxation would be to change the way of how this semantics gets presented to HW in explicit parallel form for execution • Release hardware from the necessity to analyze binaries and extract parallelism • Let the software layer be responsible for finding available parallelism and optimizations (via Binary Translation technology) • Let HW be responsible for optimal scheduling only (remove unneeded complexity from hardware and make it simpler) – like in the unconstrained case • Actually Binary Translation allows using all mechanisms of the unconstrained architecture, with addition of: o Memory ordering rules and retirement o Checkpoint for target context reconstruction and events processing o Memory renaming technique for memory conflicts resolution in binaries via bigger register file and special guard HW structure • Unfortunately, due to semantics compatibility reasons the constrained architecture cannot support security and aggressive procedure level parallelization
  • 22. 22  In the constrained architecture functionality (semantics) of all its elements (data and operations) is strongly determined by compatibility requirements  But first let’s consider the unconstrained computer system and its elements, which were developed in accordance with the approach described above.  Note: All technologies and mechanisms are appropriate for both the constrained and the unconstrained systems Method of New Functionality Design
  • 23. 23 Primitive Data Types & Operations  Primitive data types (HW keeps their types together with the value): – Potential infinity (integer) – Potential continuity (floating point) – Predicates – Enumerable types (e.g. character) – Uninitialized data – Data Descriptor and Functional Descriptor (“auxiliary” data types for technical operations)  Primitive Data Types are Dynamic Data Types – Value is kept together with tag  Type Safety Approach – All primitive operations check types of their arguments
  • 24. 24 User Defined Data Types (Objects)  The “natural” requirements for the new architecture to support language level functionality, consistent with “abstract algorithm” ideas: 1. Every procedure can generate a new data object and receive a reference to this new object 2. This procedure, using received reference, can do everything possible with this new object (read data from this object and update the content, execute this object as a program, and delete the object) 3. No other procedure can access this object just after it was generated, but this procedure can give a reference to this object with all or limited rights listed above to anybody it knows (has a reference to it) 4. Any procedure can generate a copy of reference to any object it’s aware of with decreased rights 5. After the object has been deleted, nobody can access it (all existing references are invalid)  Data creation with orientation on objects is an important step for data structuring, according to semantics of the source algorithm
  • 25. 25 Dangling Pointers and Memory Compaction  To solve the dangling pointer problem (point 5) we must guarantee that after an object has been deleted, no one can access the memory occupied by this object.  The de-allocation procedure frees the physical memory, but not the virtual memory. So physical memory can be reused, but virtual memory still remains being allocated  The well-known classical solution is a garbage collection algorithm, but it’s inefficient for solution of the dangling pointer problem  When virtual memory gets close to its limit, the system starts compacting the virtual memory  The compaction algorithm*): – Each Data Descriptor is tagged, i.e. there is a special bit in registers and in memory which marks Data Descriptors – The system identifies what Data Descriptors are useless (point to objects de-allocated in physical memory) and replaces them by Uninitialized data, or just re-directs them to non-existent memory page, thus releasing the virtual pages which the descriptor had pointed to (according to the size of the object) – The rest of the objects are moved to the vacant virtual memory, and their Data Descriptor’s base address is replaced by the new virtual address  This compaction can be fulfilled as a background process *Note: this compaction algorithm has been implemented in Elbrus-1,2 computers, it can be modified to make it more efficient
  • 26. 26 Procedures  Procedure is the fundamental notion of HLL. Every procedure has a reference to its code and context.  The procedure context consists of the code, global data, parameters/return data, and its local data  A procedure can be called via Functional Descriptor only (tagged value) Entry point address Global context Functional Descriptor Tag Global data Procedure code Local data 1. A procedure can create Functional Descriptor (FD) with a special instruction, providing an entry point address and a Data Descriptor to some context as arguments, i.e. any procedure can define another procedure 2. A procedure, which has generated this FD, can give this new FD to anybody it has access, and this new owner also can call this new procedure via FD 3. A procedure that has generated an FD includes references to the code and global data into this FD 4. A procedure, which got FD of the new procedure, can call this procedure and can pass it some parameters (atomically). 5. Caller can receive some return data as a result of procedure execution. Data return is logically an atomic action 6. The called procedure can’t use anything beyond the context it has been provided by the functional descriptor and the parameters
  • 27. 27 Capability Mechanism  Only the system that provides type safety allows the correct implementation of the procedure mechanism. A procedure can be called via Functional Descriptor only  Procedure has access to its context only. No other procedure can access this procedure’s context, if it has not been passed as a parameter to that other procedure  This approach introduces a very strong inter-procedure protection  Data Descriptor (DD) and Functional Descriptor (FD) is a capability to do something for the procedure, which has DD or FD in its context: – DD is a capability to access some object – and FD is a capability to do something – execute some procedure, which can modify some global data in the called procedure, the data, which is not directly accessible by caller  Implementation of some operations, which should work with bit-level representations of special data types like DD, FD (COMPACTION algorithm is a good example) sometimes need operation support in HW. All these operations are also primitive operations; however, only a limited number of procedures should be able to use them
  • 28. 28 Full Solution of Security Problem  The described approach does not need a privileged mode for system programming – E.g. in Elbrus, all programs, including OS, are written as “application” programs  Capability approach is more powerful and more general than the privileged mode approach (consistently implemented in Elbrus; no C-list, which is wrong)  However, even this architecture cannot protect against mistakes in user programs. Probably, the only possible remedy in this case is possibility to prove correctness of user and kernel program – A formal proof of functional correctness was done for seL4 microkernel in 2009 by NICTA group (National Information and Communications Technology, Australia)  Even in this case, only the suggested architecture can be helpful to simplify considerably the proof of program correctness (for both kernel and applications)
  • 30. 30 Object Oriented Memory (OOM) Structure  Object oriented memory was initially introduced in Burroughs B5500 computer architecture, but was not implemented correctly  All basic principles were carefully designed first in Elbrus 1 (1972-78)  Present-day memory and cache systems are corrupted by compatibility with linear structure of old computers. That means that future system should not use a traditional memory and caches organization, which excludes compiler from applying efficient optimizations  OOM structure even for constrained architecture (according to preliminary estimations) can decrease cache sizes by up to 2-3 times and nearly exclude performance losses due to cache misses  Object oriented physical memory approach: – The size of physical memory allocated for an object is equal to the object size – Each allocated object is also loaded in the virtual space with pages of fixed size – Each new object in virtual space is allocated starting from a new page contiguously (if the size of the object is smaller than the page size, then the end of the virtual space of this page is empty) Virtual Memory Object N EMPTY EMPTY Object M Physical Memory
  • 31. 31  OOM uses virtual numbers of the objects instead of virtual memory addresses  Virtual page number is allocated sequentially during each object generation  There exists a system register, which keeps the next still free object number being used for the next object being generated  We will use sometimes the expression “virtual address” having in mind “virtual number”. Object’s virtual number N Index Virtual Page N (1) Virtual Page N (2) Virtual Address Object N EMPTY Next Object Number SysReg N+1 Object Oriented Memory: Objects Naming Rules
  • 32. 32 Allocation of Objects and Sub-objects in Caches  Unlike today’s TLB, used in contemporary computers, in this OOM architecture TLB translates virtual address not into memory physical address, but directly into physical location in some specific cache, where this piece of data is located  In each specific cache, as well as in memory, the new architecture does not use cache lines (like superscalar does)  Object’s parts allocated on cache levels are split into smaller parts, and all these parts belong to the same virtual page  Each cache level could have its own small TLB
  • 33. 33 Generation of an Object  A special instruction in HW is used to generate an object (no SW library calls, as e.g. malloc, no OS system calls)  The list of all occupied spaces is contained in TLB, and the system supports special lists for all free spaces. Each free-list maintains the free areas of a certain set of the sizes (more likely of power of 2)  For physical address allocation, HW should take physical address from one of the free-lists (the first empty chunk from the corresponding list - also from a special HW register)  The result of the instruction execution is the corresponding Data Descriptor. GENOBJ Object Type Object Size Free-lists 2 4 8 2N Data Descriptor Note: Links are located inside free memory chunks
  • 34. 34 The Compiler Controls OOM Usage  This memory/cache system organization allows the compiler to have a strong control of execution process  Compiler is aware of all program semantics information and can perform more sophisticated optimizations  Compiler can preload the needed data to high-level cache, at first without assigning a more precious register memory, and can move these data from cache to registers only at the last moment. But now even preloading directly into the registers sometimes could be a good alternative – now we have a big register file.  This cache organization allows using access to the first level cache directly from an instruction by physical addresses without using virtual address and associative search.  To do this, the base register (BR) can support a special mode, in which it includes pointers to the physical location of the first level cache together with its virtual address.
  • 35. 35 Explicitly Parallel Instruction Execution in NArch+  In NArch+ architecture all mutually independent executable objects can be executed in parallel to each other. This includes: – Operations – Chains of dependent operations inside scalar and/or iterations of loop code – Procedures – Jobs  NArch+ overcomes difficulties and constraints of Data Flow and Single IP approaches, excludes any “artificial binding” in HW (program is a parallel graph)  Two different approaches have been investigated in NArch+ for program data graph execution: strands and streams (see next slides)
  • 36. 36 STRANDs Oriented Architecture HW scheduler RFEXEC Parallel HW • Strands express parallelism via chains of data dependent (mainly) operations (in more natural way than e.g. in VLIW) and provide new opportunity for presenting parallelism to OoO HW • Simple instruction scheduling for parallel execution – Need to look only at the oldest instructions in each Strand (much smaller and simpler RS) • Strands also provide: – Bigger effective instruction window – Reduced register usage (via intra-strand accumulators) – Wider instruction issue width (via clustering with register-to-register communication) • Adding ability to express parallelism in uISA gives additional advantages, e.g. superior control over speculation and control over power, better HW utilization, much more opportunities for optimizations, and for resolving the memory latency issue HW scheduler EXEC RF HW scheduler EXEC RF Cluster 1 Cluster 2 Inter connect Original data graph IP1 IP2 IP3 strands
  • 37. 37  Strands are extracted from the program data graph by the compiler  Each strand is executed by HW in-order, but out-of-order relative to each other  HW allocates a set of resources for each active strand (called WAY)  The compiler creates a strand via special FORK operation, which takes a free WAY for the strand execution  BUT the compiler has to be aware of the number of WAYs available in HW and to schedule strands accordingly. Otherwise there could be a deadlock situation (e.g. no free way to spawn new strands, and other strands are waiting for some result from this new strand)  Having Strand (WAY) as a resource for the compiler potentially limits parallelism Drawbacks of the STRANDs Architecture Way 0 Way 2Way 1 FORK A A: B: FORK B
  • 38. 38 DL/CL Mechanism for Register/Predicate Reuse  Definition-Line (DL): – Definition Line L is a group of DL-instructions in different streams, which form an explicit DL-front dividing streams into intervals – DL-front crosses all alive streams according to possible timing analysis. Fronts are successive – no cross each other  Check-Line (CL): – Check Line (CL) is a group of CL-instructions suspending execution of some streams until the specified DL-front is completely passed – After that a corresponding register/predicate resource can be safely reused A +DL B C D E G H I K L M N O P Q +DL +DL +DL +DLF R S time CL -2
  • 39. 39 Intelligent Branch Processing – Conventional: Branch predict one path, discard everything when wrong – New Architecture: Speculate when necessary, discard only misspeculated work – Increases performance – Reduces wasted energy due to misspeculation – According to our statistics, 80% of branches are not critical and can be executed without speculation
  • 40. 40 STREAMs Oriented Architecture Streams and How They Get Created • First let’s describe the simplest case, when an algorithm to be executed is a scalar by its nature (acyclic data-dependency graph) without conditional braches • Let’s have the total number of operations equal to the number of available registers (single assignment, no register reuse) • For this simple case: – No decoding stage (each instruction is ready to be loaded into the corresponding execution unit, the compiler prepares the code) – For each instruction in the graph the compiler calculates a “Priority Value Number” (PVN). This number is the number of clocks from this instruction up to the end of the graph along the longest path. Compiler will present the code in a number of sequences of dependent instructions - “streams” – As the first instruction in the new stream, the compiler takes an instruction with the highest PVN, not included yet into any other stream. For each next instruction in this stream, the compiler again selects an instruction with the highest PVN, data dependent on the previous instruction in the stream. And so on, until the stream reaches either the end of the scalar code, or until it gets into some other stream. Streams decompositionData Dependency graph Stream 1 Stream 2 Stream 3
  • 41. 41 Scalar Code Execution With STREAMs Execution Engine (Workers)  Register File: – Each register has an EMPTY/FULL bit (EMPTY - to prevent from reading the register, when value is not ready yet, and FULL – to prevent from writing to the register, when not all dependent instructions have consumed the value) – Each register has an additional bit showing, if an operation generating the value for this register has been already sent to an execution unit (EU) or is in the Reservation Station (RS)  Main scheduling and execution mechanisms for Streams are “workers” (16 per cluster)  How the workers work: – Workers issue ready instructions to the RS/Execution units (the arguments are FULL, or predecessors are in the RS/EU) – Each register has a list of streams, waiting for the result in this register – If a waiting stream is ready for execution (the value is ready), it gets moved to the “waiting for a free worker” queue – A free worker takes an instruction from the “waiting for workers queue” or from the Instruction Buffer – If an argument of the next instruction in the stream is not ready yet, the worker stops executing this stream and puts it into the waiting queue for this argument (register)
  • 42. 42 NArch+: Scalar Code Execution More Complex Case (Bigger Code)  If scalar code is big enough, the DL/CL technique is applied for registers reuse to guarantee correct dynamic execution of streams and optimal utilization of the Instruction Buffer  When the code before CLN has been executed, it is necessary to preload the next part of the code between CLN and CLN+1. Similarly, when DLN is crossed, all code area above can be freed  The size of code between CLN and CLN+1 is not bigger than the size of the Register File  Time of execution can be improved with the help of the Dynamic Feedback mechanism (both in HW and SW)  If there are conditional branches in the code, the compiler uses speculative streams to handle these cases efficiently (predicated streams and GATE instruction to check predicate value and to kill one of the streams in case of wrong speculation)  More details on speculation techniques (e.g. load/store speculation, efficient branch handling without branch prediction) would require more low-level micro-architecture details. Alas!  This scalar technology is nearly the same both for constrained and unconstrained versions of the architecture This scalar code execution technique is a practical implementation of Data Flow architecture
  • 43. 43 Summary: Strands vs. Streams  Strands HW scheduler RFEXEC Parallel HW Original program graph The mechanism of strands execution (one way per strand) is visible to the compiler, so the compiler has to watch how many strands are going to be executed by HW at each moment, and the number is limited by the number of ways Ways in HW  Streams Cons: can lead to deadlock, limits parallelism due to explicit resource (ways) scheduling by the compiler The compiler can create any number of streams, the mechanism of streams execution is not visible to the compiler Pro: No deadlock, HW executes the original graph, natural data flow execution mechanism Original program graph HW scheduler RFEXEC Parallel HW Workers RS
  • 44. 44 NArch+: Code with Loops  Use loop iteration parallelism (both iteration internal and inter- iteration) as fully as possible  Loop iterations analysis performed by the compiler: – Find instructions, which are self-dependent over iteration – Find the groups of instructions, which being self-dependent, are also mutually dependent over the iterations (“rings” of data dependency) – The rest of instructions create sequences or graph of dependent instructions (a number of “rows”) – The result of each row is either an output of the iteration (STORE, for example), or is used by another row(s) or ring(s).  Each “ring” and/or “row” loop is producing data, which are consumed by other small loops. Each producer can have a number of consumers. However, producer and consumer should be connected through a buffer, giving possibility for producer to go forward, if consumer is not ready yet to use these data
  • 45. 45 Loops Handling in NArch+  Differences between NArch and NArch+ in loops implementation: – NArch+ does not need to support compatibility with Single IP approach; therefore, many different loops can be executed together (even a “single” loop can also be executed out-of-order) – NArch+ has a simple memory system without speculative buffers; therefore, in some cases (speculations only) it is necessary to use some other mechanisms and a new HW support  Types of loops, handled by NArch+: – RECURRENT loop (including WHILE loop) – DO ALL (trip count is known before the loop start) – DO ALL (trip count becomes known during the loop execution only) – Loop with low probable “maybe” dependence between iterations (through memory) (including WHILE loop) – Loop with “maybe” data dependence within iterations
  • 46. 46 Parallel Procedure Execution • For constrained architecture, procedure can be executed on a different number of clusters, but no more than four. • Compiler will try to do in-lining of as many called procedures as possible to be able to use the resulted procedure parallelism in full degree. • As usual for constrained case, caller will wait for the end of called procedure and will work with the same resources. • Call, as well as return, is logically atomic step, however, to increase performance using DL/CL technology there will be prolog and epilog areas, where both caller and callee are working together without interfering with each other. • In unconstrained architecture, new HLL allows parallel procedures execution, but again each procedure will use no more than four clusters. • If some procedure has DO ALL loop inside, this loop can use all available HW (many, up to all clusters in chip - ~60 today).
  • 47. 47 All Basic Parts of Computer Technology and Their Current Status
  • 48. 48 NArch/IA Architecture (IA compatible case study)  NArch/IA is x86 compatible new micro-architecture based on strands approach – NArch Strand – a sequence of (usually dependent, but can include control flow) operations with its own IP; strands are executed out-of-order, in parallel – BT parses IA binaries, extracts strands and provides them to HW for scheduling and execution – Multiple strands allow overlapping of memory accesses (thus improving memory latency)  A fairly wide CPU due to scalable clustering – One or two bi-clusters (up to 4 clusters and 24 instructions issue width – 16 strands per cluster) – Clusters are tightly-coupled (register-to-register communication and synchronization)  Very large sparse instruction window – Much larger than in conventional superscalar (~1K instructions) – Branch resolution in large window (no HW branch predictor) – Memory disambiguation in large window – Smart retirement in large window (no retirement for registers)  Binary Translation for IA compatibility and enabling NArch uarch – Dynamic and static BT for maximum ST/MT performance and efficiency  Highly parameterized architecture (scalability) – Variable number of clusters/strands per cluster – Dynamically reconfigurable machine (ST/MT) Result is higher Performance and lower power at the same time
  • 49. 49 Advantages of the New Architecture Compatible (constrained) case • This approach can ensure full compatibility with some of existing binaries (ARM, x86, POWER, RISC-V, etc.) or even with all of them on the same HW with the help of Binary Translation • Preliminary investigations allow us to do the following rather reliable predictions: – A compatible version (NArch) can reach the best possible, un-improvable performance restricted by binary semantics constraints (not by binary’s sequential presentation) and amount of resources available for specific model only – ~3x-4x ST performance @ unconstrained power vs. OOO Core – ~2x ST performance @ iso-power – Less than ~50% of power @ iso-performance – ~2x MT performance @ iso-power vs OOO Core
  • 50. 50 Advantages of the New Architecture Un-Compatible (unconstrained) case • If we release HW architecture from the requirement to maintain compatibility with old style programming, then: – We can significantly simplify the architecture (e.g. 70-75% of constrained architecture has the burden of maintaining compatibility with SS) – Introduce explicit parallelism in programming languages to expose the algorithm structure to HW more easily – Introduce security in HW (tagged architecture) and, eventually, get rid of viruses and make programming safe and reliable – Get rid of obsolete cache memory hierarchy (object oriented memory) – Eventually, increase significantly the performance (up to 5x-7x or even more) – Improve scalability and universality (new distributive, HW model-oriented compiler) – Build absolutely un-improvable computer architecture • As a result of high universality of this architecture we can hope that now all special applications like machine learning, computer vision, graphics will be supported well with high performance
  • 51. 51 T H A N K Y O U ! Q & A
  • 52. 52Intel Labs Joint Pathfinding Backup Slides
  • 53. 53  Each TLB entry besides helping to translate virtual address into physical data location can include also some documentation of referenced object: its size, its user data type (Object Type Name - OTN), and maybe some other information  It also includes references to more detailed tables of physical locations of all elements of this object in cache(s)  Each object should not be necessarily presented in memory. Some objects can be generated, for example, in DCU (Level 1cache) only Access rights Object Number Sub-object information TLB Entry Data Descriptor Object Size Object Type Physical location TLB Entry TLB TLB Entry DCU LLC MLC Physical Memory + Index Object or sub-object Object Oriented Memory: TLB Structure
  • 54. 54 Advantages of Object Oriented Memory System  Unlike superscalar, OOM memory/cache system is visible to compiler – no uncontrollable physical pages, lines, cache structure hidden from the compiler. This helps significantly to improve the efficiency  Explicit object oriented structure helps to increase efficiency of memory usage. All free memory is explicitly visible to the compiler and HW  Ability to access the first level cache using physical addresses directly from instructions promises a huge increase in efficiency  Inexpensive memory allocation (without OS and library calls) also helps to increase efficiency and makes it simple to design Operating System  Eviction process is explicitly controlled by the compiler  Compiler has full knowledge of cache structure and can make nearly all procedure-local data as resident in the first level cache and can make them accessible by physical addresses, this substantially will decrease cache misses.  Cache size will also be reduced  The compiler can control objects and sub-objects allocation and preloading
  • 55. 55 STREAMs Oriented Architecture Removing Drawbacks of STRANDs Approach  Get the maximum parallelism available in Program Data Graph and execute Graph itself  Still chains of data dependent operations are presented to HW, but they are just hints - STREAMs, not the real resource  New mechanism of STREAMs execution – WORKERs  No deadlocks anymore as streams are not the static scheduling resource in the compiler (any number of streams), HW “workers” dynamically choose operations from the ready streams and dispatch them to the Reservation Station for execution  More details on next slides… 1 3 2 4 7 5 8 9 11 10 6 Program Data Graph WORKERs 74 2 138 1 3 7 Reservation Station Execution 1 3 7 8 2 2 8 STREAMs