SlideShare a Scribd company logo
Savitribai Phule PUNE UNIVERSITY
Reconfigurable Computing (504203) Module I Notes
INTRODUCTION
Those architectures can be categorized in three main groups according to their degree of flexibility:
i. The general purpose computing group that is based on the Von Neumann (VN) computing paradigm;
ii. Domain-specific processors, tailored for a class of applications having in common a great range of
characteristics;
iii. Application-specific processors tailored for only one application.
1. General Purpose Computing
General Purpose Computers are based on Von-Neumann Architecture. According to which, a computer could have a
simple, fixed structure, able to execute any kind of computation, given a properly programmed control, without the need
for hardware modification.
The general structure of a VN machine is as shown in figure below:
 It consists of:
 A memory for storing program and data. Harvard architectures contain two parallel accessible memories for
storing program and data separately.
 A control unit (also called control path) featuring a program counter that holds the address of the next instruction
to be executed.
 An arithmetic and logic unit (also called data path) in which instructions are executed.
 A program is coded as a set of instructions to be executed sequentially, instruction after instruction. At each step of the
program execution, the next instruction is fetched from the memory at the address specified in the program counter and
decoded. The required operands are then collected from the memory before the instruction is executed. After execution,
the result is written back into the memory.
 In this process, the control path is in charge of setting all signals necessary to read from and write to the memory, and to
allow the data path to perform the right computation. The data path is controlled by the control path, which interprets the
instructions and sets the data path’s signals accordingly to execute the desired operation.
 In general, the execution of an instruction on a VN computer can be done in five cycles: Instruction Read (IR) in which
an instruction is fetched from the memory; Decoding (D) in which the meaning of the instruction is determined and the
operands are localized; Read Operands (R) in which the operands are read from the memory; Execute (EX) in which the
instruction is executed with the read operands; Write Result (W) in which the result of the execution is stored back to the
memory.
 In each of those five cycles, only the part of the hardware involved in the computation is activated. The rest remains
idle.
For example, if the IR cycle is to be performed, the program counter will be activated to get the address of the instruction,
the memory will be addressed and the instruction register to store the instruction before decoding will be also activated.
Apart from those three units (program counter, memory and instruction register), all the other units remain idle.
Fortunately, the structure of instructions allows several of them to occupy the idle part of the processor, thus increasing
the computation throughput.
Savitribai Phule PUNE UNIVERSITY
Reconfigurable Computing (504203) Module I Notes
Instruction Level Parallelism
Pipelining is a transparent way to optimize the hardware utilization as well as the performance of programs. Because the
execution of one instruction cycle affects only a part of the hardware, idle parts could be activated by having many
different cycles executing together.
For one instruction, it is not possible to have many cycles being executed together. For instance, any attempt to perform
the execute cycle (EX) together with the reading of an operand (R) for the same instruction will not work, because the
data needed for the EX should first be provided by R cycle.
Nevertheless, the two cycles EX and D can be performed in parallel for two different instructions. Once the data have
been collected for the first instruction, its execution can start while the data are being collected for the second
instruction. This overlapping in the execution of instructions is called pipelining or instruction level parallelism (ILP),
and it is aimed at increasing the throughput in the execution of instructions as well as the resource utilization. It should
be mentioned that ILP does not reduce the execution latency of a single execution, but increases the throughput of a set
of instructions.
.
If tcycle is the time needed to execute one cycle, then the execution of one instruction will require 5∗tcycle to perform. If
three instructions have to be executed, then the time needed to perform the execution of those three instructions without
pipelining is 15 ∗ tcycle, as illustrated in figure. Using pipelining, the ideal time needed to perform those three
instructions, when no hazards have to be dealt with, is 7∗tcycle. In reality, we must take hazards into account. This
increases the overall computation time to 9 ∗ tcycle.
The main advantage of the VN computing paradigm is its flexibility, because it can be used to program almost all
existing algorithms. However, each algorithm can be implemented on a VN computer only if it is coded according to
the VN rules. We say in this case that ‘The algorithm must adapt itself to the hardware’
Disadvantage:
 With the fact that all algorithms must be sequentially programmed to run on a VN computer, many algorithms
cannot be executed with their potential best performance. Algorithms that usually perform the same set of inherent
parallel operations on a huge set of data are not good candidates for implementation on a VN machine.
 If the class of algorithms to be executed is known in advance, then the processor can be modified to better match
the computation paradigm of that class of application. In this case, the data path will be tailored to always execute
the same set of operations, thus making the memory access for instruction fetching as well as the instruction
decoding redundant.
Also because of the temporal use of the same hardware for a wide variety of applications, VN computation is often
characterized as ‘temporal computation’.
Savitribai Phule PUNE UNIVERSITY
Reconfigurable Computing (504203) Module I Notes
2. Domain-Specific Processors
 A domain-specific processor is a processor tailored for a class of algorithms.
 The data path is tailored for an optimal execution of a common set of operations that mostly characterizes the
algorithm in the given class.
 Also, memory access is reduced as much as possible.
Digital Signal Processor (DSP) belong to the most used domain-specific processors.
 A DSP is a specialized processor used to speed-up computation of repetitive numerically intensive tasks in signal
processing areas such as telecommunication, multimedia, automobile, radar, sonar, seismic, image processing, etc.
 The most often cited feature of the DSPs is their ability to perform one or more multiply accumulate (MAC)
operations in single cycle. Usually, MAC operations have to be performed on a huge set of data. In a MAC
operation, data are first multiplied and then added to an accumulated value.
 The normal VN computer would perform a MAC in 10 steps. The first instruction (multiply) would be fetched, then
decoded, then the operand would be read and multiply, the result would be stored back and the next instruction
(accumulate) would be read, the result stored in the previous step would be read again and added to the accumulated
value and the result would be stored back.
 DSPs avoid those steps by using specialized hardware that directly performs the addition after multiplication without
having to access the memory.
 Because many DSP algorithms involve performing repetitive computations, most DSP processors provide special
support for efficient looping. Often a special loop or repeat instruction is provided, which allows a loop
implementation without expending any instruction cycles for updating and testing the loop counter or branching
back to the top of the loop.
 DSPs are also customized for data with a given width according to the application domain. For example if a DSP is
to be used for image processing, then pixels have to be processed. If the pixels are represented in Red Green Blue
(RGB) system where each colour is represented by a byte, then an image processing DSP will not need more than 8
bit data path. Obviously, the image processing DSP cannot be used again for applications requiring 32 bits
computation.
Advantage and Disadvantage:-
 This specialization of the DSPs increases the performance of the processor and improves the device utilization.
However, the flexibility is reduced, because it cannot be used anymore to implement other applications other than
those for which it was optimally designed.
3. Application-Specific Processors
Although DSPs incorporate a degree of application-specific features such as MAC and data width optimization, they
still incorporate the VN approach and, therefore, remain sequential machines. Their performance is limited.
If a processor has to be used for only one application, which is known and fixed in advance, then the processing unit
could be designed and optimized for that particular application. In this case, we say that ‘the hardware adapts itself t the
application’.
For example, In multimedia processing, processors are usually designed to perform the compression of video frames
according to a video compression standard. Such processors cannot be used for something else than compression. Even
in compression, the standard must exactly match the one implemented in the processors.
A processor designed for only one application is called an Application-Specific Processor (ASIP).
 In an ASIP, the instruction cycles (IR, D, EX, W) are eliminated.
 The instruction set of the application is directly implemented in hardware.
 Input data stream in the processor through its inputs, the processor performs the required computation and the
results can be collected at the outputs of the processor.
 ASIPs are usually implemented as single chips called Application-Specific Integrated Circuit (ASIC)
Example : If algorithm 1 has to execute on a Von Neumann computer, then at least 3 instructions are required.
Savitribai Phule PUNE UNIVERSITY
Reconfigurable Computing (504203) Module I Notes
With tcycle being the instruction cycle, the program will be executed in 3 ∗ 5 ∗ tcycles = 15 ∗ tcycle without pipelining.
Let us now consider the implementation of the same algorithm in an ASIP. We can implement the instructions d = a +
b and c = a ∗ b in parallel. The same is also true for d = b + 1, c = a − 1 as illustrated in figure below:
Fig: ASIP implementation of Algorithm 1
The four instructions a+b, a∗b, b+1, a−1 as well as the comparison a < b will be executed in parallel in a first stage.
Depending on the value of the comparison a < b, the correct values of the previous stage computations will be assigned
to c and d as defined in the program.
Let tmax be the longest signal needed by a signal to move from one point to another in the physical implementation of
the processor (this will happen on the path Input-multiply-multiplex). tmax is also called the cycle time of the ASIP
processor.
For two inputs a and b, the results c and d can be computed in time tmax. The VN processor can compete with this
ASIP only if 15 ∗ tcycle < tmax, i.e. tcycle < tmax/15. The VN must be at least 15 times faster than the ASIP to be
competitive.
ASIPs use a spatial approach to implement only one application. The functional units needed for the computation of all
parts of the application must be available on the surface of the final processor. This kind of computation is called
‘Spatial Computing’.
Once again, an ASIP that is built to perform a given computation cannot be used for other tasks other than those for
which it has been originally designed.
4. Reconfigurable Computing
We can identify two main means to characterize processors: flexibility and performance.
VN COMPUTERS
 The VN computers are very flexible because they are able to compute any kind of task. This is the reason why the
terminology GPP (General Purpose Processor) is used for the VN machine.
 They do not bring so much performance, because they cannot compute in parallel. Moreover, the five steps (IR, D,
R, EX, W) needed to perform one instruction becomes a major drawback, in particular if the same instruction has
to be executed on huge sets of data.
 Flexibility is possible because ‘the application must always adapt to the hardware’ in order to be executed.
ASIPs:
 ASIPs bring much performance because they are optimized for a particular application. The instruction set
required for that application can then be built in a chip.
 Performance is possible because ‘the hardware is always adapted to the application’.
Savitribai Phule PUNE UNIVERSITY
Reconfigurable Computing (504203) Module I Notes
Ideally, we would like to have the flexibility of the GPP and the performance of the ASIP in the same device. We would
like to have a device able ‘to adapt to the application’ on the fly. We call such a hardware device a reconfigurable
hardware or reconfigurable device or reconfigurable processing unit (RPU).
Reconfigurable computing is defined as the study of computation using reconfigurable devices.
For a given application, at a given time, the spatial structure of the device will be modified such as to use the best
computing approach to speed up that application. If a new application has to be computed, the device structure will be
modified again to match the new application. Contrary to the VN computers, which are programmed by a set of
instructions to be executed sequentially, the structure of reconfigurable devices are changed by modifying all or part of
the hardware at compile-time or at run-time, usually by downloading a so called bitstream into the device.
Configuration is the process of changing the structure of a reconfigurable device at star-up-time respectively at run-
time.
Reconfiguration is the process of changing the structure of a reconfigurable device at run-time.
5. Fields of Application
5.1 Rapid Prototyping
 The development of an ASIC that can be seen as physical implementation of ASIPs is a cumbersome process
consisting of several steps, from the specification down to the layout of the chip and the final production. Because of
the different optimization goals in development, several teams of engineers are usually involved. This increases the
Non-recurring engineering (NRE) cost that can only be amortized if the final product is produced in a very large
quantity.
 Contrary to software development, errors discovered after the production mean enormous loss because the produced
pieces become unusable.
 Rapid prototyping allows a device to be tested in real hardware before the final production. In this sense, errors can
be corrected without affecting the pieces already produced. A reconfigurable device is useful here, because it can be
used several times to implement different versions of the final product until an error-free state.
 One of the concepts related to rapid prototyping is hardware emulation, in analogy to software simulation. With
hardware emulation, a hardware module is tested under real operating conditions in the environment where it will be
deployed later.
5.2 In-System Customization
 Time to market is one of the main challenges that electronic manufacturers face today. In order to secure market
segments, manufacturers must release their products as quickly as possible. In many cases, a well working product
can be released with less functionality. The manufacturer can then upgrade the product on the field to incorporate
new functionalities.
 A reconfigurable device provides such capabilities of being upgraded in the field, by changing the configuration.
 In-system customization can also be used to upgrade systems that are deployed into non-accessible or very difficult
to access locations. One example is the Mars rover vehicle in which some FPGAs, which can be modified from the
earth, are used.
5.3 Multi-modal Computation
 The number of electronic devices we interact with is permanently increasing. Many people hold besides a mobile
phone, other devices such as handhelds, portable mp3 player, portable video player, etc. Besides those mobile
devices, fixed devices such as navigation systems, music and video players as well as TV devices are available in
cars and at home.
 All those devices are equipped with electronic control units that run the desire application on the desired device.
Furthermore, many of the devices are used in a time multiplexed fashion. It is difficult to imagine someone playing
mp3 songs while watching a video clip and given a phone call.
 For a group of devices used exclusively in a time multiplexed way, only one electronic control unit can be used.
Whenever a service is needed, the control unit is connected to the corresponding device at the correct location and
reconfigured with the adequate configuration.
 For instance, a domestic mp3, a domestic DVD player, a car mp3, a car DVD player as well as a mobile mp3 player
and a mobile video player can all share the same electronic unit, if they are always used by the same person.
 The user just needs to remove the control unit from the domestic devices and connect them to one car device when
going to work. The control unit can be removed from the car and connected to a mobile device if the user decides to
go for a walk. Coming back home, the electronic control unit is removed from the mobile device and used for
watching video.
Savitribai Phule PUNE UNIVERSITY
Reconfigurable Computing (504203) Module I Notes
5.4 Adaptive Computing Systems
 Advances in computation and communication are helping the development of ubiquitous and pervasive computing.
 The design of ubiquitous computing system is cumbersome task that cannot be dealt with only at compile time.
Because of uncertainty and unpredictability of such systems, it is impossible, at compile time, to address all
scenarios that can happen at run-time, because of unpredictable changes in the environment.
 We need computing systems that are able to adapt their behavior and structure to change operating and
environmental conditions, to time-varying optimizing objectives, and to physical constraints such as changing
protocols and new standards. We call those computing systems Adaptive Computing System.
 Reconfiguration can provide a good fundament for the realization of adaptive systems, because it allows system to
quickly react to changes by adopting the optimal behavior for a given run-time scenario.
6. Reconfigurable Devices
 Broadly considered, reconfigurable devices fill their silicon area with a large number of computing primitives,
interconnected via a configurable network.
 The operation of each primitive can be programmed as well as the interconnect pattern. Computational tasks can be
implemented spatially on the device with intermediates flowing directly from the producing function to the
receiving function.
 Reconfigurable computing generally provides spatially-oriented processing rather than the temporally-oriented
processing typical of programmable architectures such as microprocessors.
6.1 Reconfigurable Devices characteristics:
The key differences between reconfigurable machines and conventional processors are:
Instruction Distribution – Rather than broadcasting a new instruction to the functional units on every cycle,
instructions are locally configured, allowing the reconfigurable device to compress instruction stream distribution and
effectively deliver more instructions into active silicon on each cycle.
Spatial routing of intermediates – As space permits, intermediate values are routed in parallel from producing
function to consuming function rather than forcing all communication to take place in time through a central resource
bottleneck.
More, often finer-grained, separately programmable building blocks – Reconfigurable devices provide a large
number of separately programmable building blocks allowing a greater range of computations to occur per time step.
This effect is largely enabled by the compressed instruction distribution.
Distributed deployable resources, eliminating bottlenecks – Resources such as memory, interconnect, and functional
units are distributed and deployable based on need rather than being centralized in large pools. Independent, local
access allows reconfigurable designs to take advantage of high, local, parallel on-chip bandwidth, rather than creating a
central resource bottleneck.
7. Programmable, Configurable and Fixed Function Devices
Programmable – we will use the term “programmable” to refer to architectures which heavily and rapidly reuse a single
piece of active circuitry for many different functions. The canonical example of a programmable device is a processor
which may perform a different instruction on its ALU on every cycle. All processors, be they microcoded, SIMD, Vector,
or VLIW are included in this category.
Configurable– we use the term “configurable”to refer to architectures where the active circuitry can perform any of a
number of different operations, but the function cannot be changed from cycle to cycle. FPGAs are our canonical example
of a configurable device.
Fixed Function, Limited Operation Diversity, High Throughput – When the function and data granularity to be
computed are well-understood and fixed, and when the function can be economically implemented in space, dedicated
hardware provides the most computational capacity per unit area to the application.
Variable Function, Low Diversity – If the function required is unknown or varying, but the instruction or data diversity is
low; the task can be mapped directly to a reconfigurable computing device and efficiently extract high computational
density.
Space Limited, High Entropy – If we are limited spatially and the function to be computed has a high operation and data
diversity, we are forced to reuse limited active space heavily and accept limited instruction and data bandwidth. In this
regime, conventional processor organization are most effective since they dedicate considerable space to on-chip
instruction storage in order to minimize off-chip instruction traffic while executing descriptively complex tasks.
Savitribai Phule PUNE UNIVERSITY
Reconfigurable Computing (504203) Module I Notes
8. Key Relations
From an empirical review of conventional, reconfigurable devices, we see that 80-90% of the area is dedicated to the
switches and wires making up the reconfigurable interconnect. Most of the remaining area goes into configuration
memory for the network. The actually logic function only accounts for a few percent of the area in a reconfigurable
device.
This interconnect and configuration overhead is responsible for the 100 X density disadvantage which reconfigurable
devices suffer relative to hardwired logic.
To a first order approximation, this gives us:
It is this basic relationship which characterizes the RP design space.
 Since Areaconfig<<Areanet , devices with a single on-chip configuration, such as most reconfigurable devices, can
afford to exert fine-grained control over their operations – any savings associated with sharing configuration
bits would be small compared to the network area.
 Since Areaconfig<<Areanet , to pack the most functional diversity into a part, one can allocate multiple
configurations on chip. With the order-of-magnitude relative sizes given in Relation 1.1, up to a 10 increase in
the functional diversity per unit area is attainable in this manner.
 However, since Areanet is only 10 X Areaconfig, if the number of configurations is large, say 100 or more, the
configuration memory area will become the dominant size factor. Processors are essentially optimized into this
regime and this accounts for their 10times lower performance density compared to reconfigurable devices.
 Once we go to a large number of contexts, such that the total configuration memory space begins to dominate
interconnect area, fine-granularity becomes costly. In this regime wideword operations allow us to reduce
instruction area across multiple bit processing elements.
This simultaneously allows machines with wide (e.g. 32 bit) datapaths to hold 1000’s of configurations on chip
while making them only 10 less computationally dense than finegrained, single context devices.
9. General-Purpose Computing
 General-purpose computing devices are specifically intended for those cases where we cannot or need not dedicate
sufficient spatial resources to support an entire computational task or where we do not know enough about the
required task or tasks prior to fabrication to hardwire the functionality.
 The key ideas behind general-purpose processing are:
i. Postpone binding of functionality until device is employed – i.e. after fabrication
ii. Exploit temporal reuse of limited functional capacity
 Delayed binding and temporal reuse work closely together and occur at many scales to provide the characteristics
we now expect from general-purpose computing devices.
 Market Level – Rather than dedicating a machine design to a single application or application family, the
design effort may be utilized for many different applications.
 System Level – Rather than dedicating an expensive machine to a single application, the machine may
perform different applications at different times by running different sets of instructions.
 Application Level – Rather than spending precious real estate to build a separate computational unit for
each different function required, central resources may be employed to perform these functions in sequence
with an additional input, an instruction, telling it how to behave at each point in time.
 Algorithm Level – Rather than fixing the algorithms which an application uses, an existing general-purpose
machine can be reprogrammed with new techniques and algorithms as they are developed.
 User Level – Rather than fixing the function of the machine at the supplier, the instruction stream specifies
the function, allowing the end user to use the machine as best suits his needs.Machines may be used for
Savitribai Phule PUNE UNIVERSITY
Reconfigurable Computing (504203) Module I Notes
functions which the original designers did not conceive. Further, machine behavior may be upgraded in the
field without incurring any hardware or hardware handling costs.
10. General-Purpose Computing Issues
There are two key features associated with general-purpose computers which distinguish them from their specialized
counterparts:
Interconnect
 In general-purpose machines, the datapaths between functional units cannot be hardwired.
 Different tasks will require different patterns of interconnect between the functional units. Within a task
individual routines and operations may require different interconnectivity of functional units.
 General-purpose machines must provide the ability to direct data flow between units. In the extreme of a single
functional unit, memory locations are used to perform this routing function.
 As more functional units operate together on a task, spatial switching is required to move data among functional
units and memory.
 The flexibility and granularity of this interconnect is one of the big factors determining yielded capacity on a
given application.
Instructions
 Since general-purpose devices must provide different operations over time, either within a computational task or
between computational tasks, they require additional inputs, instructions, which tell the silicon how to behave at
any point in time.
 Each general-purpose processing element needs one instruction to tell it what operation to perform and where to
find its inputs.
 The handling of this additional input is one of the key distinguishing features between different kinds of general-
purpose computing structures.
 When the functional diversity is large and the required task throughput is low, it is not efficient to build up the
entire application dataflow spatially in the device. Rather, we can realize applications, or collections of
applications, by sharing and reusing limited hardware resources in time (See Figure 2.1) and only replicating the
less expensive memory for instruction and intermediate data storage.
11. Regular and Irregular Computing Tasks
Computing tasks can be classified informally by their regularity:
 Regular tasks perform the same sequence of operations repeatedly. Regular tasks have few data dependent
conditionals such that all data processed requires essentially the same sequence of operations with highly
predictable flow of execution. Nested loops with static bounds and minimal conditionals are the typical example
of regular computational tasks.
 Irregular computing tasks, in contrast, perform highly data dependent operations. The operations required vary
widely according to the data processed, and the flow of operation control is difficult to predict.
12. Metrics: Density, Diversity, and Capacity
The goal of general-purpose computing devices is to provide a common, computational substrate which can be used to
perform any particular task. Metrics are tools to characterize the amount of computation provided by a given general-
purpose structure.
For example, If we were simply comparing tasks in terms of a fixed processor instruction set, we might measure this
computational work in terms of instruction evaluations. If we were comparing tasks to be implemented in a gate-array
we might compare the number of gates required to implement the task and the number of time units required to
complete it. Here, we want to consider the portion of the computational work done for the task on each cycle of
execution of diverse, general-purpose computing devices of a certain size.
Functional Density-Functional Capacity
Functional capacity is a space-time metric which tells us how much computational work a structure can do per unit
time. Correspondingly, our “gate-evaluation” metric is a unit of space-time capacity.
 That is, if a device can provide capacity Dcap gate evaluations per second, optimally, to the application, a task
requiring Nge gate evaluations can be completed in time:
Savitribai Phule PUNE UNIVERSITY
Reconfigurable Computing (504203) Module I Notes
 In practice, limitations on the way the device’s capacity may be employed will often cause the task to take longer and
the result in a lower yielded capacity. If the task takes time TTask-actual , to perform NTask_ge, then the yielded capacity is:
 The capacity which a particular structure can provide generally increases with area. To understand the relative benefits
of each computing structure or architecture independent of the area min a particular implementation, we can look at
the capacity provided per unit area. We normalize area in units of , half the minimum feature size, to make the results
independent of the particular process feature size.
 Our metric for functional density, then, is capacity per unit area and is measured in terms of gate-evaluations per unit
space-time in units of gate-evaluations/2
. The general expression for computing functional density given an
operational cycle time tcycle for a unit of silicon of size A evaluating Nge gate evaluations per cycle is:
Concluding Remarks on Functional Density:
 This density metric tells us the relative merits of dedicating a portion of our silicon budget to a particular computing
architecture.
 The density focus makes the implicit assumption that capacity can be composed and scaled to provide the aggregate
computational throughput required for a task.
 To the extent this is true, architectures with the highest capacity density that can actually handle a problem should
require the least size and cost.
 Remember also that resource capacity is primarily interesting when the resource is limited. We look at computational
capacity to the extent we are compute limited. When we are i/o limited, performance may be much more controlled by
i/o bandwidth.
Functional Diversity – Instruction Density
Functional density alone only tells us how much raw throughput we get. It says nothing about how many different
functions can be performed and on what time scale. For this reason, it is also interesting to look at the functional
diversity or instruction density. We thus define functional diversity as:
Concluding Remarks on Functional Diversity:
 We use functional diversity to indicate how many distinct function descriptions are resident per unit of area.
 This tells us how many different operations can be performed within part or all of an IC without going outside of the
region or chip for additional instructions.
Data Density
Space must also be allocated to hold data for the computation. This area also competes with logic, interconnect, and
instructions for silicon area. Thus, we will also look at data density when examining the capacity of an architecture. If
we put of Nbits of data for an application into a space, A , we get a data density:
Savitribai Phule PUNE UNIVERSITY
Reconfigurable Computing (504203) Module I Notes
12. Fine-Grained Reconfigurable Architectures and Coarse-Grained Reconfigurable Architectures
 Reconfigurable computing are divided into two groups, according to the level of the used hardware Abstraction
or Granularity:
i. Fine-Grained Reconfigurable Architectures
ii. Coarse-Grained Reconfigurable Architectures
Fine-grained reconfigurable devices
 These are hardware devices, whose functionality can be modified on a very low level of granularity. For
instance, the device can be modified such as to add or remove a single inverter or a single two input NAND-
gate.
 Fine-grained reconfigurable devices are mostly represented by programmable logic devices (PLD). This consists
of programmable logic arrays (PLA), programmable array logics (PAL), complex programmable logic devices
(CPLD) and the field programmable gate arrays (FPGA).
 On fine-grained reconfigurable devices, the number of gates is often used as unit of measure for the capacity of
an RPU.
 Example: FPGA
 Issues:
 Bit level operation => Several processing units => Large routing overhead => Low silicon area
efficiency => Use up more Power
 High volume of configuration data needed => Large configuration memory => More power Dissipation
and long configuration time
Coarse-grained reconfigurable devices
 Also called pseudo reconfigurable device.
 These devices usually consist of a set of coarse-grained elements such as ALUs that allow for the efficient
implementation of a dataflow function.
 Hardware abstraction is based on a set of fixed blocks (like functional units, processor cores and memory tiles)
 Merely the reprogramming of interconnections
 Coarse-grained reconfigurable devices do not have any established measure of comparison. However, the
number of PEs on a device as well as their characteristics such as the granularity, the amount of memory on
the device and the variety of peripherals may provide an indication on the quality of the device.
 Coarse Grained architecture overcomes disadvantages of fine grain architecture
 Provides multiple-bit wide datapath and complex operator
 Avoids routing overhead
 Needs Fewer lines ->Lower area use

More Related Content

What's hot (20)

Reconfigurable Computing
Reconfigurable ComputingReconfigurable Computing
Reconfigurable Computing
 
fpga programming
fpga programmingfpga programming
fpga programming
 
Twin well process
Twin well processTwin well process
Twin well process
 
Embedded system
Embedded systemEmbedded system
Embedded system
 
Embedded System Presentation
Embedded System PresentationEmbedded System Presentation
Embedded System Presentation
 
Embedded System Basics
Embedded System BasicsEmbedded System Basics
Embedded System Basics
 
ASIC vs FPGA
ASIC vs FPGAASIC vs FPGA
ASIC vs FPGA
 
Fpga architectures and applications
Fpga architectures and applicationsFpga architectures and applications
Fpga architectures and applications
 
SHORT CHANNEL EFFECTS IN MOSFETS- VLSI DESIGN
SHORT CHANNEL EFFECTS IN MOSFETS- VLSI DESIGNSHORT CHANNEL EFFECTS IN MOSFETS- VLSI DESIGN
SHORT CHANNEL EFFECTS IN MOSFETS- VLSI DESIGN
 
FPGA Design Challenges
FPGA Design ChallengesFPGA Design Challenges
FPGA Design Challenges
 
ARM7-ARCHITECTURE
ARM7-ARCHITECTURE ARM7-ARCHITECTURE
ARM7-ARCHITECTURE
 
Fabrication steps of IC
Fabrication steps of ICFabrication steps of IC
Fabrication steps of IC
 
Delays in verilog
Delays in verilogDelays in verilog
Delays in verilog
 
Power dissipation cmos
Power dissipation cmosPower dissipation cmos
Power dissipation cmos
 
Pass Transistor Logic
Pass Transistor LogicPass Transistor Logic
Pass Transistor Logic
 
Short Channel Effect In MOSFET
Short Channel Effect In MOSFETShort Channel Effect In MOSFET
Short Channel Effect In MOSFET
 
Adiabatic logic or clock powered logic
Adiabatic logic or clock powered logicAdiabatic logic or clock powered logic
Adiabatic logic or clock powered logic
 
Unit4.tms320c54x
Unit4.tms320c54xUnit4.tms320c54x
Unit4.tms320c54x
 
Vlsi design flow
Vlsi design flowVlsi design flow
Vlsi design flow
 
Field-programmable gate array
Field-programmable gate arrayField-programmable gate array
Field-programmable gate array
 

Similar to Reconfigurable computing

computer architecture and the fetch execute cycle By ZAK
computer architecture and the fetch execute cycle By ZAKcomputer architecture and the fetch execute cycle By ZAK
computer architecture and the fetch execute cycle By ZAKTabsheer Hasan
 
1.3.2 computer architecture and the fetch execute cycle By ZAK
1.3.2 computer architecture and the fetch execute cycle By ZAK1.3.2 computer architecture and the fetch execute cycle By ZAK
1.3.2 computer architecture and the fetch execute cycle By ZAKTabsheer Hasan
 
Multilevel arch & str org.& mips, 8086, memory
Multilevel arch & str org.& mips, 8086, memoryMultilevel arch & str org.& mips, 8086, memory
Multilevel arch & str org.& mips, 8086, memoryMahesh Kumar Attri
 
What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreadingFraboni Ec
 
Computer Arithmetic and Processor Basics
Computer Arithmetic and Processor BasicsComputer Arithmetic and Processor Basics
Computer Arithmetic and Processor BasicsShinuMMAEI
 
16-bit Microprocessor Design (2005)
16-bit Microprocessor Design (2005)16-bit Microprocessor Design (2005)
16-bit Microprocessor Design (2005)Susam Pal
 
Bindura university of science education
Bindura university of science educationBindura university of science education
Bindura university of science educationInnocent Tauzeni
 
Ch2 embedded processors-i
Ch2 embedded processors-iCh2 embedded processors-i
Ch2 embedded processors-iAnkit Shah
 
03. top level view of computer function &amp; interconnection
03. top level view of computer function &amp; interconnection03. top level view of computer function &amp; interconnection
03. top level view of computer function &amp; interconnectionnoman yasin
 

Similar to Reconfigurable computing (20)

Co notes3 sem
Co notes3 semCo notes3 sem
Co notes3 sem
 
computer architecture and the fetch execute cycle By ZAK
computer architecture and the fetch execute cycle By ZAKcomputer architecture and the fetch execute cycle By ZAK
computer architecture and the fetch execute cycle By ZAK
 
1.3.2 computer architecture and the fetch execute cycle By ZAK
1.3.2 computer architecture and the fetch execute cycle By ZAK1.3.2 computer architecture and the fetch execute cycle By ZAK
1.3.2 computer architecture and the fetch execute cycle By ZAK
 
Multilevel arch & str org.& mips, 8086, memory
Multilevel arch & str org.& mips, 8086, memoryMultilevel arch & str org.& mips, 8086, memory
Multilevel arch & str org.& mips, 8086, memory
 
What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreading
 
Co question 2008
Co question 2008Co question 2008
Co question 2008
 
Co question 2006
Co question 2006Co question 2006
Co question 2006
 
Computer Arithmetic and Processor Basics
Computer Arithmetic and Processor BasicsComputer Arithmetic and Processor Basics
Computer Arithmetic and Processor Basics
 
Introduction to Microcontrollers
Introduction to MicrocontrollersIntroduction to Microcontrollers
Introduction to Microcontrollers
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
Oversimplified CA
Oversimplified CAOversimplified CA
Oversimplified CA
 
3.3
3.33.3
3.3
 
Pipeline Computing by S. M. Risalat Hasan Chowdhury
Pipeline Computing by S. M. Risalat Hasan ChowdhuryPipeline Computing by S. M. Risalat Hasan Chowdhury
Pipeline Computing by S. M. Risalat Hasan Chowdhury
 
Spr ch-01
Spr ch-01Spr ch-01
Spr ch-01
 
16-bit Microprocessor Design (2005)
16-bit Microprocessor Design (2005)16-bit Microprocessor Design (2005)
16-bit Microprocessor Design (2005)
 
1.prallelism
1.prallelism1.prallelism
1.prallelism
 
1.prallelism
1.prallelism1.prallelism
1.prallelism
 
Bindura university of science education
Bindura university of science educationBindura university of science education
Bindura university of science education
 
Ch2 embedded processors-i
Ch2 embedded processors-iCh2 embedded processors-i
Ch2 embedded processors-i
 
03. top level view of computer function &amp; interconnection
03. top level view of computer function &amp; interconnection03. top level view of computer function &amp; interconnection
03. top level view of computer function &amp; interconnection
 

More from Sudhanshu Janwadkar

Keypad Interfacing with 8051 Microcontroller
Keypad Interfacing with 8051 MicrocontrollerKeypad Interfacing with 8051 Microcontroller
Keypad Interfacing with 8051 MicrocontrollerSudhanshu Janwadkar
 
ASIC design Flow (Digital Design)
ASIC design Flow (Digital Design)ASIC design Flow (Digital Design)
ASIC design Flow (Digital Design)Sudhanshu Janwadkar
 
Introduction to 8051 Timer/Counter
Introduction to 8051 Timer/CounterIntroduction to 8051 Timer/Counter
Introduction to 8051 Timer/CounterSudhanshu Janwadkar
 
Architecture of the Intel 8051 Microcontroller
Architecture of the Intel 8051 MicrocontrollerArchitecture of the Intel 8051 Microcontroller
Architecture of the Intel 8051 MicrocontrollerSudhanshu Janwadkar
 
Introduction to Embedded Systems
Introduction to Embedded SystemsIntroduction to Embedded Systems
Introduction to Embedded SystemsSudhanshu Janwadkar
 
Interconnects in Reconfigurable Architectures
Interconnects in Reconfigurable ArchitecturesInterconnects in Reconfigurable Architectures
Interconnects in Reconfigurable ArchitecturesSudhanshu Janwadkar
 
Design and Implementation of a GPS based Personal Tracking System
Design and Implementation of a GPS based Personal Tracking SystemDesign and Implementation of a GPS based Personal Tracking System
Design and Implementation of a GPS based Personal Tracking SystemSudhanshu Janwadkar
 
Embedded Logic Flip-Flops: A Conceptual Review
Embedded Logic Flip-Flops: A Conceptual ReviewEmbedded Logic Flip-Flops: A Conceptual Review
Embedded Logic Flip-Flops: A Conceptual ReviewSudhanshu Janwadkar
 

More from Sudhanshu Janwadkar (20)

DSP Processors versus ASICs
DSP Processors versus ASICsDSP Processors versus ASICs
DSP Processors versus ASICs
 
Keypad Interfacing with 8051 Microcontroller
Keypad Interfacing with 8051 MicrocontrollerKeypad Interfacing with 8051 Microcontroller
Keypad Interfacing with 8051 Microcontroller
 
ASIC design Flow (Digital Design)
ASIC design Flow (Digital Design)ASIC design Flow (Digital Design)
ASIC design Flow (Digital Design)
 
LCD Interacing with 8051
LCD Interacing with 8051LCD Interacing with 8051
LCD Interacing with 8051
 
Interrupts in 8051
Interrupts in 8051Interrupts in 8051
Interrupts in 8051
 
Serial Communication in 8051
Serial Communication in 8051Serial Communication in 8051
Serial Communication in 8051
 
SPI Bus Protocol
SPI Bus ProtocolSPI Bus Protocol
SPI Bus Protocol
 
I2C Protocol
I2C ProtocolI2C Protocol
I2C Protocol
 
Introduction to 8051 Timer/Counter
Introduction to 8051 Timer/CounterIntroduction to 8051 Timer/Counter
Introduction to 8051 Timer/Counter
 
Intel 8051 Programming in C
Intel 8051 Programming in CIntel 8051 Programming in C
Intel 8051 Programming in C
 
Hardware View of Intel 8051
Hardware View of Intel 8051Hardware View of Intel 8051
Hardware View of Intel 8051
 
Architecture of the Intel 8051 Microcontroller
Architecture of the Intel 8051 MicrocontrollerArchitecture of the Intel 8051 Microcontroller
Architecture of the Intel 8051 Microcontroller
 
Introduction to Embedded Systems
Introduction to Embedded SystemsIntroduction to Embedded Systems
Introduction to Embedded Systems
 
CMOS Logic
CMOS LogicCMOS Logic
CMOS Logic
 
Interconnects in Reconfigurable Architectures
Interconnects in Reconfigurable ArchitecturesInterconnects in Reconfigurable Architectures
Interconnects in Reconfigurable Architectures
 
Introduction to FPGAs
Introduction to FPGAsIntroduction to FPGAs
Introduction to FPGAs
 
Design and Implementation of a GPS based Personal Tracking System
Design and Implementation of a GPS based Personal Tracking SystemDesign and Implementation of a GPS based Personal Tracking System
Design and Implementation of a GPS based Personal Tracking System
 
Embedded Logic Flip-Flops: A Conceptual Review
Embedded Logic Flip-Flops: A Conceptual ReviewEmbedded Logic Flip-Flops: A Conceptual Review
Embedded Logic Flip-Flops: A Conceptual Review
 
Memory and Processor Testing
Memory and Processor TestingMemory and Processor Testing
Memory and Processor Testing
 
Pass Transistor Logic
Pass Transistor LogicPass Transistor Logic
Pass Transistor Logic
 

Recently uploaded

The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfPipe Restoration Solutions
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopEmre Günaydın
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Krakówbim.edu.pl
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxViniHema
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringC Sai Kiran
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdfKamal Acharya
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdfKamal Acharya
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwoodseandesed
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationRobbie Edward Sayers
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxwendy cai
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdfKamal Acharya
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdfKamal Acharya
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Electivekarthi keyan
 
Toll tax management system project report..pdf
Toll tax management system project report..pdfToll tax management system project report..pdf
Toll tax management system project report..pdfKamal Acharya
 
shape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptxshape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptxVishalDeshpande27
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdfKamal Acharya
 

Recently uploaded (20)

The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering Workshop
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptx
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Toll tax management system project report..pdf
Toll tax management system project report..pdfToll tax management system project report..pdf
Toll tax management system project report..pdf
 
shape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptxshape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptx
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 

Reconfigurable computing

  • 1. Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes INTRODUCTION Those architectures can be categorized in three main groups according to their degree of flexibility: i. The general purpose computing group that is based on the Von Neumann (VN) computing paradigm; ii. Domain-specific processors, tailored for a class of applications having in common a great range of characteristics; iii. Application-specific processors tailored for only one application. 1. General Purpose Computing General Purpose Computers are based on Von-Neumann Architecture. According to which, a computer could have a simple, fixed structure, able to execute any kind of computation, given a properly programmed control, without the need for hardware modification. The general structure of a VN machine is as shown in figure below:  It consists of:  A memory for storing program and data. Harvard architectures contain two parallel accessible memories for storing program and data separately.  A control unit (also called control path) featuring a program counter that holds the address of the next instruction to be executed.  An arithmetic and logic unit (also called data path) in which instructions are executed.  A program is coded as a set of instructions to be executed sequentially, instruction after instruction. At each step of the program execution, the next instruction is fetched from the memory at the address specified in the program counter and decoded. The required operands are then collected from the memory before the instruction is executed. After execution, the result is written back into the memory.  In this process, the control path is in charge of setting all signals necessary to read from and write to the memory, and to allow the data path to perform the right computation. The data path is controlled by the control path, which interprets the instructions and sets the data path’s signals accordingly to execute the desired operation.  In general, the execution of an instruction on a VN computer can be done in five cycles: Instruction Read (IR) in which an instruction is fetched from the memory; Decoding (D) in which the meaning of the instruction is determined and the operands are localized; Read Operands (R) in which the operands are read from the memory; Execute (EX) in which the instruction is executed with the read operands; Write Result (W) in which the result of the execution is stored back to the memory.  In each of those five cycles, only the part of the hardware involved in the computation is activated. The rest remains idle. For example, if the IR cycle is to be performed, the program counter will be activated to get the address of the instruction, the memory will be addressed and the instruction register to store the instruction before decoding will be also activated. Apart from those three units (program counter, memory and instruction register), all the other units remain idle. Fortunately, the structure of instructions allows several of them to occupy the idle part of the processor, thus increasing the computation throughput.
  • 2. Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes Instruction Level Parallelism Pipelining is a transparent way to optimize the hardware utilization as well as the performance of programs. Because the execution of one instruction cycle affects only a part of the hardware, idle parts could be activated by having many different cycles executing together. For one instruction, it is not possible to have many cycles being executed together. For instance, any attempt to perform the execute cycle (EX) together with the reading of an operand (R) for the same instruction will not work, because the data needed for the EX should first be provided by R cycle. Nevertheless, the two cycles EX and D can be performed in parallel for two different instructions. Once the data have been collected for the first instruction, its execution can start while the data are being collected for the second instruction. This overlapping in the execution of instructions is called pipelining or instruction level parallelism (ILP), and it is aimed at increasing the throughput in the execution of instructions as well as the resource utilization. It should be mentioned that ILP does not reduce the execution latency of a single execution, but increases the throughput of a set of instructions. . If tcycle is the time needed to execute one cycle, then the execution of one instruction will require 5∗tcycle to perform. If three instructions have to be executed, then the time needed to perform the execution of those three instructions without pipelining is 15 ∗ tcycle, as illustrated in figure. Using pipelining, the ideal time needed to perform those three instructions, when no hazards have to be dealt with, is 7∗tcycle. In reality, we must take hazards into account. This increases the overall computation time to 9 ∗ tcycle. The main advantage of the VN computing paradigm is its flexibility, because it can be used to program almost all existing algorithms. However, each algorithm can be implemented on a VN computer only if it is coded according to the VN rules. We say in this case that ‘The algorithm must adapt itself to the hardware’ Disadvantage:  With the fact that all algorithms must be sequentially programmed to run on a VN computer, many algorithms cannot be executed with their potential best performance. Algorithms that usually perform the same set of inherent parallel operations on a huge set of data are not good candidates for implementation on a VN machine.  If the class of algorithms to be executed is known in advance, then the processor can be modified to better match the computation paradigm of that class of application. In this case, the data path will be tailored to always execute the same set of operations, thus making the memory access for instruction fetching as well as the instruction decoding redundant. Also because of the temporal use of the same hardware for a wide variety of applications, VN computation is often characterized as ‘temporal computation’.
  • 3. Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes 2. Domain-Specific Processors  A domain-specific processor is a processor tailored for a class of algorithms.  The data path is tailored for an optimal execution of a common set of operations that mostly characterizes the algorithm in the given class.  Also, memory access is reduced as much as possible. Digital Signal Processor (DSP) belong to the most used domain-specific processors.  A DSP is a specialized processor used to speed-up computation of repetitive numerically intensive tasks in signal processing areas such as telecommunication, multimedia, automobile, radar, sonar, seismic, image processing, etc.  The most often cited feature of the DSPs is their ability to perform one or more multiply accumulate (MAC) operations in single cycle. Usually, MAC operations have to be performed on a huge set of data. In a MAC operation, data are first multiplied and then added to an accumulated value.  The normal VN computer would perform a MAC in 10 steps. The first instruction (multiply) would be fetched, then decoded, then the operand would be read and multiply, the result would be stored back and the next instruction (accumulate) would be read, the result stored in the previous step would be read again and added to the accumulated value and the result would be stored back.  DSPs avoid those steps by using specialized hardware that directly performs the addition after multiplication without having to access the memory.  Because many DSP algorithms involve performing repetitive computations, most DSP processors provide special support for efficient looping. Often a special loop or repeat instruction is provided, which allows a loop implementation without expending any instruction cycles for updating and testing the loop counter or branching back to the top of the loop.  DSPs are also customized for data with a given width according to the application domain. For example if a DSP is to be used for image processing, then pixels have to be processed. If the pixels are represented in Red Green Blue (RGB) system where each colour is represented by a byte, then an image processing DSP will not need more than 8 bit data path. Obviously, the image processing DSP cannot be used again for applications requiring 32 bits computation. Advantage and Disadvantage:-  This specialization of the DSPs increases the performance of the processor and improves the device utilization. However, the flexibility is reduced, because it cannot be used anymore to implement other applications other than those for which it was optimally designed. 3. Application-Specific Processors Although DSPs incorporate a degree of application-specific features such as MAC and data width optimization, they still incorporate the VN approach and, therefore, remain sequential machines. Their performance is limited. If a processor has to be used for only one application, which is known and fixed in advance, then the processing unit could be designed and optimized for that particular application. In this case, we say that ‘the hardware adapts itself t the application’. For example, In multimedia processing, processors are usually designed to perform the compression of video frames according to a video compression standard. Such processors cannot be used for something else than compression. Even in compression, the standard must exactly match the one implemented in the processors. A processor designed for only one application is called an Application-Specific Processor (ASIP).  In an ASIP, the instruction cycles (IR, D, EX, W) are eliminated.  The instruction set of the application is directly implemented in hardware.  Input data stream in the processor through its inputs, the processor performs the required computation and the results can be collected at the outputs of the processor.  ASIPs are usually implemented as single chips called Application-Specific Integrated Circuit (ASIC) Example : If algorithm 1 has to execute on a Von Neumann computer, then at least 3 instructions are required.
  • 4. Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes With tcycle being the instruction cycle, the program will be executed in 3 ∗ 5 ∗ tcycles = 15 ∗ tcycle without pipelining. Let us now consider the implementation of the same algorithm in an ASIP. We can implement the instructions d = a + b and c = a ∗ b in parallel. The same is also true for d = b + 1, c = a − 1 as illustrated in figure below: Fig: ASIP implementation of Algorithm 1 The four instructions a+b, a∗b, b+1, a−1 as well as the comparison a < b will be executed in parallel in a first stage. Depending on the value of the comparison a < b, the correct values of the previous stage computations will be assigned to c and d as defined in the program. Let tmax be the longest signal needed by a signal to move from one point to another in the physical implementation of the processor (this will happen on the path Input-multiply-multiplex). tmax is also called the cycle time of the ASIP processor. For two inputs a and b, the results c and d can be computed in time tmax. The VN processor can compete with this ASIP only if 15 ∗ tcycle < tmax, i.e. tcycle < tmax/15. The VN must be at least 15 times faster than the ASIP to be competitive. ASIPs use a spatial approach to implement only one application. The functional units needed for the computation of all parts of the application must be available on the surface of the final processor. This kind of computation is called ‘Spatial Computing’. Once again, an ASIP that is built to perform a given computation cannot be used for other tasks other than those for which it has been originally designed. 4. Reconfigurable Computing We can identify two main means to characterize processors: flexibility and performance. VN COMPUTERS  The VN computers are very flexible because they are able to compute any kind of task. This is the reason why the terminology GPP (General Purpose Processor) is used for the VN machine.  They do not bring so much performance, because they cannot compute in parallel. Moreover, the five steps (IR, D, R, EX, W) needed to perform one instruction becomes a major drawback, in particular if the same instruction has to be executed on huge sets of data.  Flexibility is possible because ‘the application must always adapt to the hardware’ in order to be executed. ASIPs:  ASIPs bring much performance because they are optimized for a particular application. The instruction set required for that application can then be built in a chip.  Performance is possible because ‘the hardware is always adapted to the application’.
  • 5. Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes Ideally, we would like to have the flexibility of the GPP and the performance of the ASIP in the same device. We would like to have a device able ‘to adapt to the application’ on the fly. We call such a hardware device a reconfigurable hardware or reconfigurable device or reconfigurable processing unit (RPU). Reconfigurable computing is defined as the study of computation using reconfigurable devices. For a given application, at a given time, the spatial structure of the device will be modified such as to use the best computing approach to speed up that application. If a new application has to be computed, the device structure will be modified again to match the new application. Contrary to the VN computers, which are programmed by a set of instructions to be executed sequentially, the structure of reconfigurable devices are changed by modifying all or part of the hardware at compile-time or at run-time, usually by downloading a so called bitstream into the device. Configuration is the process of changing the structure of a reconfigurable device at star-up-time respectively at run- time. Reconfiguration is the process of changing the structure of a reconfigurable device at run-time. 5. Fields of Application 5.1 Rapid Prototyping  The development of an ASIC that can be seen as physical implementation of ASIPs is a cumbersome process consisting of several steps, from the specification down to the layout of the chip and the final production. Because of the different optimization goals in development, several teams of engineers are usually involved. This increases the Non-recurring engineering (NRE) cost that can only be amortized if the final product is produced in a very large quantity.  Contrary to software development, errors discovered after the production mean enormous loss because the produced pieces become unusable.  Rapid prototyping allows a device to be tested in real hardware before the final production. In this sense, errors can be corrected without affecting the pieces already produced. A reconfigurable device is useful here, because it can be used several times to implement different versions of the final product until an error-free state.  One of the concepts related to rapid prototyping is hardware emulation, in analogy to software simulation. With hardware emulation, a hardware module is tested under real operating conditions in the environment where it will be deployed later. 5.2 In-System Customization  Time to market is one of the main challenges that electronic manufacturers face today. In order to secure market segments, manufacturers must release their products as quickly as possible. In many cases, a well working product can be released with less functionality. The manufacturer can then upgrade the product on the field to incorporate new functionalities.  A reconfigurable device provides such capabilities of being upgraded in the field, by changing the configuration.  In-system customization can also be used to upgrade systems that are deployed into non-accessible or very difficult to access locations. One example is the Mars rover vehicle in which some FPGAs, which can be modified from the earth, are used. 5.3 Multi-modal Computation  The number of electronic devices we interact with is permanently increasing. Many people hold besides a mobile phone, other devices such as handhelds, portable mp3 player, portable video player, etc. Besides those mobile devices, fixed devices such as navigation systems, music and video players as well as TV devices are available in cars and at home.  All those devices are equipped with electronic control units that run the desire application on the desired device. Furthermore, many of the devices are used in a time multiplexed fashion. It is difficult to imagine someone playing mp3 songs while watching a video clip and given a phone call.  For a group of devices used exclusively in a time multiplexed way, only one electronic control unit can be used. Whenever a service is needed, the control unit is connected to the corresponding device at the correct location and reconfigured with the adequate configuration.  For instance, a domestic mp3, a domestic DVD player, a car mp3, a car DVD player as well as a mobile mp3 player and a mobile video player can all share the same electronic unit, if they are always used by the same person.  The user just needs to remove the control unit from the domestic devices and connect them to one car device when going to work. The control unit can be removed from the car and connected to a mobile device if the user decides to go for a walk. Coming back home, the electronic control unit is removed from the mobile device and used for watching video.
  • 6. Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes 5.4 Adaptive Computing Systems  Advances in computation and communication are helping the development of ubiquitous and pervasive computing.  The design of ubiquitous computing system is cumbersome task that cannot be dealt with only at compile time. Because of uncertainty and unpredictability of such systems, it is impossible, at compile time, to address all scenarios that can happen at run-time, because of unpredictable changes in the environment.  We need computing systems that are able to adapt their behavior and structure to change operating and environmental conditions, to time-varying optimizing objectives, and to physical constraints such as changing protocols and new standards. We call those computing systems Adaptive Computing System.  Reconfiguration can provide a good fundament for the realization of adaptive systems, because it allows system to quickly react to changes by adopting the optimal behavior for a given run-time scenario. 6. Reconfigurable Devices  Broadly considered, reconfigurable devices fill their silicon area with a large number of computing primitives, interconnected via a configurable network.  The operation of each primitive can be programmed as well as the interconnect pattern. Computational tasks can be implemented spatially on the device with intermediates flowing directly from the producing function to the receiving function.  Reconfigurable computing generally provides spatially-oriented processing rather than the temporally-oriented processing typical of programmable architectures such as microprocessors. 6.1 Reconfigurable Devices characteristics: The key differences between reconfigurable machines and conventional processors are: Instruction Distribution – Rather than broadcasting a new instruction to the functional units on every cycle, instructions are locally configured, allowing the reconfigurable device to compress instruction stream distribution and effectively deliver more instructions into active silicon on each cycle. Spatial routing of intermediates – As space permits, intermediate values are routed in parallel from producing function to consuming function rather than forcing all communication to take place in time through a central resource bottleneck. More, often finer-grained, separately programmable building blocks – Reconfigurable devices provide a large number of separately programmable building blocks allowing a greater range of computations to occur per time step. This effect is largely enabled by the compressed instruction distribution. Distributed deployable resources, eliminating bottlenecks – Resources such as memory, interconnect, and functional units are distributed and deployable based on need rather than being centralized in large pools. Independent, local access allows reconfigurable designs to take advantage of high, local, parallel on-chip bandwidth, rather than creating a central resource bottleneck. 7. Programmable, Configurable and Fixed Function Devices Programmable – we will use the term “programmable” to refer to architectures which heavily and rapidly reuse a single piece of active circuitry for many different functions. The canonical example of a programmable device is a processor which may perform a different instruction on its ALU on every cycle. All processors, be they microcoded, SIMD, Vector, or VLIW are included in this category. Configurable– we use the term “configurable”to refer to architectures where the active circuitry can perform any of a number of different operations, but the function cannot be changed from cycle to cycle. FPGAs are our canonical example of a configurable device. Fixed Function, Limited Operation Diversity, High Throughput – When the function and data granularity to be computed are well-understood and fixed, and when the function can be economically implemented in space, dedicated hardware provides the most computational capacity per unit area to the application. Variable Function, Low Diversity – If the function required is unknown or varying, but the instruction or data diversity is low; the task can be mapped directly to a reconfigurable computing device and efficiently extract high computational density. Space Limited, High Entropy – If we are limited spatially and the function to be computed has a high operation and data diversity, we are forced to reuse limited active space heavily and accept limited instruction and data bandwidth. In this regime, conventional processor organization are most effective since they dedicate considerable space to on-chip instruction storage in order to minimize off-chip instruction traffic while executing descriptively complex tasks.
  • 7. Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes 8. Key Relations From an empirical review of conventional, reconfigurable devices, we see that 80-90% of the area is dedicated to the switches and wires making up the reconfigurable interconnect. Most of the remaining area goes into configuration memory for the network. The actually logic function only accounts for a few percent of the area in a reconfigurable device. This interconnect and configuration overhead is responsible for the 100 X density disadvantage which reconfigurable devices suffer relative to hardwired logic. To a first order approximation, this gives us: It is this basic relationship which characterizes the RP design space.  Since Areaconfig<<Areanet , devices with a single on-chip configuration, such as most reconfigurable devices, can afford to exert fine-grained control over their operations – any savings associated with sharing configuration bits would be small compared to the network area.  Since Areaconfig<<Areanet , to pack the most functional diversity into a part, one can allocate multiple configurations on chip. With the order-of-magnitude relative sizes given in Relation 1.1, up to a 10 increase in the functional diversity per unit area is attainable in this manner.  However, since Areanet is only 10 X Areaconfig, if the number of configurations is large, say 100 or more, the configuration memory area will become the dominant size factor. Processors are essentially optimized into this regime and this accounts for their 10times lower performance density compared to reconfigurable devices.  Once we go to a large number of contexts, such that the total configuration memory space begins to dominate interconnect area, fine-granularity becomes costly. In this regime wideword operations allow us to reduce instruction area across multiple bit processing elements. This simultaneously allows machines with wide (e.g. 32 bit) datapaths to hold 1000’s of configurations on chip while making them only 10 less computationally dense than finegrained, single context devices. 9. General-Purpose Computing  General-purpose computing devices are specifically intended for those cases where we cannot or need not dedicate sufficient spatial resources to support an entire computational task or where we do not know enough about the required task or tasks prior to fabrication to hardwire the functionality.  The key ideas behind general-purpose processing are: i. Postpone binding of functionality until device is employed – i.e. after fabrication ii. Exploit temporal reuse of limited functional capacity  Delayed binding and temporal reuse work closely together and occur at many scales to provide the characteristics we now expect from general-purpose computing devices.  Market Level – Rather than dedicating a machine design to a single application or application family, the design effort may be utilized for many different applications.  System Level – Rather than dedicating an expensive machine to a single application, the machine may perform different applications at different times by running different sets of instructions.  Application Level – Rather than spending precious real estate to build a separate computational unit for each different function required, central resources may be employed to perform these functions in sequence with an additional input, an instruction, telling it how to behave at each point in time.  Algorithm Level – Rather than fixing the algorithms which an application uses, an existing general-purpose machine can be reprogrammed with new techniques and algorithms as they are developed.  User Level – Rather than fixing the function of the machine at the supplier, the instruction stream specifies the function, allowing the end user to use the machine as best suits his needs.Machines may be used for
  • 8. Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes functions which the original designers did not conceive. Further, machine behavior may be upgraded in the field without incurring any hardware or hardware handling costs. 10. General-Purpose Computing Issues There are two key features associated with general-purpose computers which distinguish them from their specialized counterparts: Interconnect  In general-purpose machines, the datapaths between functional units cannot be hardwired.  Different tasks will require different patterns of interconnect between the functional units. Within a task individual routines and operations may require different interconnectivity of functional units.  General-purpose machines must provide the ability to direct data flow between units. In the extreme of a single functional unit, memory locations are used to perform this routing function.  As more functional units operate together on a task, spatial switching is required to move data among functional units and memory.  The flexibility and granularity of this interconnect is one of the big factors determining yielded capacity on a given application. Instructions  Since general-purpose devices must provide different operations over time, either within a computational task or between computational tasks, they require additional inputs, instructions, which tell the silicon how to behave at any point in time.  Each general-purpose processing element needs one instruction to tell it what operation to perform and where to find its inputs.  The handling of this additional input is one of the key distinguishing features between different kinds of general- purpose computing structures.  When the functional diversity is large and the required task throughput is low, it is not efficient to build up the entire application dataflow spatially in the device. Rather, we can realize applications, or collections of applications, by sharing and reusing limited hardware resources in time (See Figure 2.1) and only replicating the less expensive memory for instruction and intermediate data storage. 11. Regular and Irregular Computing Tasks Computing tasks can be classified informally by their regularity:  Regular tasks perform the same sequence of operations repeatedly. Regular tasks have few data dependent conditionals such that all data processed requires essentially the same sequence of operations with highly predictable flow of execution. Nested loops with static bounds and minimal conditionals are the typical example of regular computational tasks.  Irregular computing tasks, in contrast, perform highly data dependent operations. The operations required vary widely according to the data processed, and the flow of operation control is difficult to predict. 12. Metrics: Density, Diversity, and Capacity The goal of general-purpose computing devices is to provide a common, computational substrate which can be used to perform any particular task. Metrics are tools to characterize the amount of computation provided by a given general- purpose structure. For example, If we were simply comparing tasks in terms of a fixed processor instruction set, we might measure this computational work in terms of instruction evaluations. If we were comparing tasks to be implemented in a gate-array we might compare the number of gates required to implement the task and the number of time units required to complete it. Here, we want to consider the portion of the computational work done for the task on each cycle of execution of diverse, general-purpose computing devices of a certain size. Functional Density-Functional Capacity Functional capacity is a space-time metric which tells us how much computational work a structure can do per unit time. Correspondingly, our “gate-evaluation” metric is a unit of space-time capacity.  That is, if a device can provide capacity Dcap gate evaluations per second, optimally, to the application, a task requiring Nge gate evaluations can be completed in time:
  • 9. Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes  In practice, limitations on the way the device’s capacity may be employed will often cause the task to take longer and the result in a lower yielded capacity. If the task takes time TTask-actual , to perform NTask_ge, then the yielded capacity is:  The capacity which a particular structure can provide generally increases with area. To understand the relative benefits of each computing structure or architecture independent of the area min a particular implementation, we can look at the capacity provided per unit area. We normalize area in units of , half the minimum feature size, to make the results independent of the particular process feature size.  Our metric for functional density, then, is capacity per unit area and is measured in terms of gate-evaluations per unit space-time in units of gate-evaluations/2 . The general expression for computing functional density given an operational cycle time tcycle for a unit of silicon of size A evaluating Nge gate evaluations per cycle is: Concluding Remarks on Functional Density:  This density metric tells us the relative merits of dedicating a portion of our silicon budget to a particular computing architecture.  The density focus makes the implicit assumption that capacity can be composed and scaled to provide the aggregate computational throughput required for a task.  To the extent this is true, architectures with the highest capacity density that can actually handle a problem should require the least size and cost.  Remember also that resource capacity is primarily interesting when the resource is limited. We look at computational capacity to the extent we are compute limited. When we are i/o limited, performance may be much more controlled by i/o bandwidth. Functional Diversity – Instruction Density Functional density alone only tells us how much raw throughput we get. It says nothing about how many different functions can be performed and on what time scale. For this reason, it is also interesting to look at the functional diversity or instruction density. We thus define functional diversity as: Concluding Remarks on Functional Diversity:  We use functional diversity to indicate how many distinct function descriptions are resident per unit of area.  This tells us how many different operations can be performed within part or all of an IC without going outside of the region or chip for additional instructions. Data Density Space must also be allocated to hold data for the computation. This area also competes with logic, interconnect, and instructions for silicon area. Thus, we will also look at data density when examining the capacity of an architecture. If we put of Nbits of data for an application into a space, A , we get a data density:
  • 10. Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module I Notes 12. Fine-Grained Reconfigurable Architectures and Coarse-Grained Reconfigurable Architectures  Reconfigurable computing are divided into two groups, according to the level of the used hardware Abstraction or Granularity: i. Fine-Grained Reconfigurable Architectures ii. Coarse-Grained Reconfigurable Architectures Fine-grained reconfigurable devices  These are hardware devices, whose functionality can be modified on a very low level of granularity. For instance, the device can be modified such as to add or remove a single inverter or a single two input NAND- gate.  Fine-grained reconfigurable devices are mostly represented by programmable logic devices (PLD). This consists of programmable logic arrays (PLA), programmable array logics (PAL), complex programmable logic devices (CPLD) and the field programmable gate arrays (FPGA).  On fine-grained reconfigurable devices, the number of gates is often used as unit of measure for the capacity of an RPU.  Example: FPGA  Issues:  Bit level operation => Several processing units => Large routing overhead => Low silicon area efficiency => Use up more Power  High volume of configuration data needed => Large configuration memory => More power Dissipation and long configuration time Coarse-grained reconfigurable devices  Also called pseudo reconfigurable device.  These devices usually consist of a set of coarse-grained elements such as ALUs that allow for the efficient implementation of a dataflow function.  Hardware abstraction is based on a set of fixed blocks (like functional units, processor cores and memory tiles)  Merely the reprogramming of interconnections  Coarse-grained reconfigurable devices do not have any established measure of comparison. However, the number of PEs on a device as well as their characteristics such as the granularity, the amount of memory on the device and the variety of peripherals may provide an indication on the quality of the device.  Coarse Grained architecture overcomes disadvantages of fine grain architecture  Provides multiple-bit wide datapath and complex operator  Avoids routing overhead  Needs Fewer lines ->Lower area use