SlideShare a Scribd company logo
1 of 30
Download to read offline
GCD FPGA-Based Design
Ibrahim Hazmi - V00835716

Design and Implementation of the
Euclidean Algorithm for Computing
the Greatest Common Divisor using
ELEC569A Project Final Report (Fall, 2014)
Supervised by Dr. Mihai Sima
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Contents
GCD FPGA-Based Design i
Contents ii
List of Figures iii
List of Tables iv
Visual Executive Summary 1
Introduction 2
Background of GCD and Euclidean Algorithm 2
Overview of Spartan-6 FPGA 4
Project Description and Milestones 7
Detailed Description of the Design 9
Behavioural Level: WHILE/FOR LOOP 9
Behavioural Level: From AMS to FSM (ASM2FSM) 13
Structural level: GCD Data-Path and Control Units (GCD2SUB) 16
Structural level: GCD with Sum of Absolute Difference (GCDSAD) 20
Overview of the Results 22
Summary of the results for the different architectures 22
The results in a Chart: 23
The Area-Delay Product (ASM, PGCD2SUB, and PGCDSAD) 23
Summary and Conclusion 24
Final Thoughts and Suggestions 25
Bibliography 26
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 ii
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
List of Figures
Fig.1 Prime Factorization method for finding the GCD of two integers 2
Fig.2 Euclidean Algorithm 3
Fig.3 Simplified Euclidean GCD Algorithm 3
Fig.4 XC6SLX25 Floor-plan View in PlanAhead 5
Fig.5 Closer View of the XC6SLX25 Floor-plan View in PlanAhead 6
Fig.6 The Design Strategy Window from Xilinx Project Navigator 8
Fig.7 WHILE_LOOP translations of the Simplified Euclidean GCD Algorithm 9
Fig.8 Finite FOR_LOOP as a replacement for the Infinite WHILE_LOOP 10
Fig.9 The Behavioural and Post-Route Simulation of FOR_LOOP Model 11
Fig.10 RTL, Technology Schematic, and Floor-plan of FOR_LOOP Model 11
Fig.11 From ASM GCD to Finite State Diagram 13
Fig.12 The Reduced Finite State Diagram with VHDL Code 13
Fig.13 The Behavioural Simulation of ASM2FSM Model 14
Fig.14 RTL, Technology Schematic, and Floor-plan of ASM2FSM Model 14
Fig.15 Block Diagram of the “Original” GCD Data-Path 16
Fig.16 Block Diagram of the Modified GCD Data-Path 16
Fig.17 The Control Unit (FSM) with VHDL Code 17
Fig.18.a Utilization of Primitive (FDCE), and Macros (LUT2 & LUT3) 17
Fig.18.b Primitives: CARRY4 Fast Carry-Chain 18
Fig.19 RTL, Technology Schematic, and Floor-plan of GCD2SUB Model 18
Fig.20 GCD with Sum of Absolute Difference (GCDSAD) 20
Fig.21 Carry-out Generation Functions for SAD 20
Fig.22 Results in a Chart (FOR_LOOP dominated) 23
Fig.23 The Area-Delay Product 23
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 iii
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
List of Tables
Table 1: Xilinx spartan-6 FPGA Feature Summary [4], [5] 4
Table 2: Slice features of spartan-6 Family (including XC6SLX25) [5] 6
Table 3: Project Milestones 8
Table 4: Mapping Report of the FOR_LOOP GCD 12
Table 5: Synthesis and Timing Report of the FOR_LOOP GCD 12
Table 6: Mapping Report: ASM2FSM GCD vs. FOR_LOOP GCD 15
Table 7: Synthesis and Timing Report - ASM2FSM GCD vs. FOR_LOOP GCD 15
Table 9: Mapping Report: Optimized vs. Simple GCD2SUB 19
Table 10: Synthesis and Timing Report - Optimized vs. Simple GCD2SUB 19
Table 11: Mapping Report: GCDSAD 21
Table 12: Synthesis and Timing Report - GCDSAD 21
Table 13: Overview of The results (O; Optimized with Primitives) 22
Table 14: Comparison Summary between Architectures 24
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 iv
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Visual Executive Summary

IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 1
The main idea of this project is to design a Digital Circuit that calculates the GCD of two 16-
bit unsigned integer numbers using Euclidean Algorithm and Implement it on Xilinx Spartan6
FPGA using different techniques/architectures. The first attempt was to see how far the
compiler goes with the behavioural loop that represents Euclidean Algorithm. Because the
tools kept copying the hardware inside the loop all the time, a massive area of the FPGA was
occupied and the number of iterations was limited. Thus, an RTL behavioural architecture
was implemented, in which only one iteration can run per each clock cycle. The compiler still
have the freedom for placement and routing with the aid of “Design and Goals Strategies”.
Then, the design was built structurally, by port-mapping all functions of the previous design
as components, to see how the compiler is going to utilize the FPGA differently from the
behavioural one. The structural model consists of two parts: GCD data-path unit and GCD
control unit (FSM). Another version of the structural design was created as an attempt to
adapt the idea of the “Sum of Absolute Difference (SAD)” in order to have only one
subtraction instead of two. Finally, Spartan-6 Primitives and Macros were utilized to reduce
the Area-Delay product of the design, and the optimized GCD with two subtractors has been
proved to give the minimum Area-Delay product among all other design architectures.
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Introduction
Implementing mathematical calculations on hardware platforms such as FPGA is quite
more challenging than performing them in a software environment, where the hardware itself
is already equipped with the calculation data-path and control unit for almost infinite number
of algorithms and arithmetic operations. Behind this pain of the hardware implementation is a
priceless gain in terms of performance as there is a great opportunity to utilize smaller area,
obtain higher speed, consume less power, or get a reasonable combination of all of these.
Calculating The greatest common divisor (GCD), is one of the problems that need number of
steps in order to be solved correctly. These steps can be transformed into an iterative
algorithm such as Euclidean algorithm, which makes the computation understandable and
traceable. This section is divided into three parts; a brief mathematical background about the
GCD computation, an overview of Xilinx Spartan-6 FPGA, and an outline of the project
description highlighting its objective and milestones.
Background of GCD and Euclidean Algorithm
The greatest common divisor (GCD) of two positive integers is the largest integer that
divides both numbers without a remainder [2]. It is also know as Greatest Common Factors
(GCF), Greatest Common Measure (GCM), Highest Common Divisor (HCD), or Highest
Common Factor (HCF) [1]. GCD can be computed by determining the prime factors of both
numbers, then multiplying the common prime factors. Practically, this method is not feasible
for great numbers. (Fig. 1) shows an example of how prime factorization method works.
Fig.1 Prime Factorization method for finding the GCD of two integers
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 2
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
An efficient method for solving GCD problems is Euclidean algorithm, which is based
on the fact that the GCD of two numbers divides the remainder of the division between them:
It is an iterative process (Fig. 2), that
takes a number of cycles to compute the
GCD. Divisions are done iteratively until
rn = 0, is obtained, then, the GCD = rn-1.
Fig.2 Euclidean Algorithm
As division is simply a subtraction, it was observed that the GCD of two numbers also
divides their difference [1], in which the design and implementation of the circuit gets easier.
The flowchart in (Fig. 3) illustrates this
simple computation process of the GCD. So,
it is obvious that the circuit should include
subtraction and comparison units, in
addition to registers and multiplexers for
data update in each iteration cycle.
Fig.3 Simplified Euclidean GCD Algorithm
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 3
gcd(a,b) = gcd(b,r)
where, a = qb + r
gcd(a,b) = gcd(b,(a − b))
= gcd(a,(b − a))
gcd(a,b) = gcd(b,r1)
gcd(b,r1) = gcd(r1,r2 )
!
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Overview of Spartan-6 FPGA
From the previous part, the chosen FPGA should have some properties to accommodate
the design units efficiently. For example, subtraction/addition may take advantage of some
dedicated components in the FPGA slices such as ripple carry-chain or DSP. Spartan-6 FPGA
family from Xilinx provides the designers with such components, which would help a great
deal in designing GCD circuit in different levels. “The thirteen-member family delivers
expanded densities ranging from 3,840 to 147,443 logic cells, with half the power consumption
of previous Spartan families, and faster, more comprehensive connectivity,” [4]. (TABLE 1)
shows a feature summary of some devices from this family; the smallest (XC6SLX4), the
largest (XC6SLX150T), and the choice of this project (XC6SLX25), which was the smallest
member of the family to accommodate the first design, i.e., the reference one. More about this
device and the reference design is presented later in this report.
TABLE 1: XILINX SPARTAN-6 FPGA FEATURE SUMMARY [4], [5]
In the next two pages, the Floor-plan views in PlanAhead for the device XC6SLX25 help
to illustrate the internal construction of the chosen device. (Fig. 4) is a full-scale floor-plan
view that shows the device layout indicating some important elements such as IOB cells,
CLBs, DSP, and block RAM columns.
Device
Logic

Cells
CLB
DSP
Slices
RAM
Blocks
User
I/OSlices FFs RAM (kb) LUT6
XC6SLX4 3,840 600 4,800 75 2,400 8 12 132
XC6SLX25 24,051 3,758 30,046 229 15,032 38 52 226
XC6SLX150T 147,443 23,038 184,304 1,355 92,152 180 268 540
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 4
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Fig.4 XC6SLX25 Floor-plan View in PlanAhead
In (Fig. 5), a closer view of the layout reveals the three different slices inside the
Configurable Logic Block (CLB) surrounding a DSP block. It is clear from the figure that Each
CLB contains two slices, one of them is SLICEX and the other one is either SLICEL or SLICEM.
(TABLE 2) presents important slice features of the XC6SLX25 device.
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 5
`
Memory
Controller
Block
Block RAM
Column
Clock
Management
Tile Column
DSP Column
CLB Cell
IOB Cells
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Fig.5 Closer View of the XC6SLX25 Floor-plan View in PlanAhead
TABLE 2: SLICE FEATURES OF SPARTAN-6 FAMILY (INCLUDING XC6SLX25) [5]
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 6
DSP
Flip-
Flops
Carry-
Chain
Storage
LUTs
LUT6
Slices SLICEX SLICEL SLICEM
6-Input LUTs √ √ √
8 Flip-flops √ √ √
Wide Multiplexers √ √
Carry Logic √ √
Distributed RAM √
Shift Registers √
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Project Description and Milestones
The objective of this project is to design a Digital Circuit that calculates the GCD of two
16-bit unsigned integer numbers using Euclidean Algorithm and Implement it on Xilinx
Spartan6 FPGA. In this project, the Euclidean GCD circuit was implemented using different
architectures in order to examine the tradeoff between area and speed, i.e., Area-Delay
product, and decide which design is more implementable in terms of dedicated configuration
components inside the FPGA. The first step was to implement a simple behavioural loop,
i. e., a direct interpretation of the Euclidean GCD Algorithm, using FOR_LOOP to see how
the compiler would represent a large number of iterations. Considering this design reference,
the next step was to implement the Euclidean GCD circuit in the following levels:
A. RTL Behavioural level, where the design is simply a Finite State Machine (FSM) that
performs the GCD calculation sequentially as a lower level of interpreting the Algorithm.
In this case, the compiler was free to translate the operations into different units/
components and place all these components and rout all the connections with the aid of
“Design and Goals Strategies” optimization.
B. Structural level, where the data flow of the GCD Algorithm is transformed into an
arithmetic circuit, i.e., data-path unit (DP), and the iteration process is attained by a simple
control unit (CU), FSM basically. The design was built abstractly transforming all
functions, such as comparison, subtraction, and data transfer, to components and port-
mapping them in a top level entity, to see how the compiler would utilize the FPGA
differently from the behavioural implementation. In this level, “Sum of Absolute
Difference,” (SAD) circuit has been introduced in order to replace the two subtraction units
with one computation unit. Finally, some functions were designed utilizing Primitives,
e.g., Look-Up-Tables “LUTs,” Carry-Chains, Flip-Flops, Exclusive-Ores, and/or DSPs, and
Macros, e.g., ADDSUB macro, with which the calculation unit has been optimized in terms
of occupied area inside the FPGA.
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 7
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
(TABLE 3) is an outline of the project milestones, and how much approximately was
achieved during the journey of the ELEC569A course.
TABLE 3: PROJECT MILESTONES
In the next section all of the above stages will be presented and discussed in sequence. It
is helpful, by this point, to mention that the target is to obtain minimum Area-Delay product,
which could be achieved by reducing the area and/or the time delay of the circuit. By
determining optimization gaol to be area (Fig. 6), smaller area of the FPGA will be utilized in
order to reduce Area*Delay. At the same time, it might also lead to higher speed, assuming
that the smaller area is obtained, the fewer jumps through interconnections is needed.
Fig.6 The Design Strategy Window from Xilinx Project Navigator

IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 8
Description Done
Design the Simple Behavioural Loop and examine its aspects and limitations 100
Design the Behavioural FSM Model and test its features and margins 100
Design the direct Structural Model (DP+CU) and compare it with the behavioural 100
Design the Optimized Structural Model (SAD) and compare it with the direct structural 100
Get into Primitive Level and utilize the dedicated elements for faster computation 100
Report the Area/Delay Comparison between all the architectures and propose suggestions 100
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Detailed Description of the Design
As mentioned in the previous section, a simple behavioural loop was implemented first,
in order to see how the compiler understands loops and how it deals with a large number of
iterations. Then, the results of this design, as a reference, kept pushing towards trying
different architectures in order to obtain smaller area and less jump through the
interconnection hoops.
Behavioural Level: WHILE/FOR LOOP
Starting with the direct WHILE_LOOP, that represents the Euclidean GCD Algorithm
(Fig. 7), the result tells a lot about how the system treats loops. It was clear that the compiler
has just copied the corresponding circuit along the way until the loop ends. It was not too
surprising that the compiler did not synthesize the (While) function, simply, because it
generates an infinite loop, which means the number of the circuit copies is infinity. In
hardware world, infinity does not usually exist, it needs to be a finite number.
Fig.7 WHILE_LOOP translations of the Simplified Euclidean GCD Algorithm
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 9
While (A /= B) Then
If (A > B) Then
A := A - B;
Else
B := B - A;
End If;
End Loop;
GCD <= B;
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Thus, the transition to the Finite FOR_LOOB (Fig. 8) was obvious, where the maximum
number of iterations must be defined from the beginning. In fact, determining the number of
iterations before even starting computing the GCD creates limitation to the design, with
which almost an infinite number of GCD calculations will return zero.
Fig.8 Finite FOR_LOOP as a replacement for the Infinite WHILE_LOOP
Before going through the design and implementation results of this model and
proceeding to the other different levels, it is essential to point out that the WHILE_LOOP
Model is a perfect transformation of the Euclidean GCD Algorithm. Therefore, all the
following design effort would be considered as an attempt to produce a synthesizable version
of the WHILE_LOOP Model.
The number of iterations in the FOR_LOOP Model was defined as 100, which means
that for any two numbers that require more than hundred iterations to compute their GCD
(e.g., 511 and 2), the result will be zero. The behavioural and Post-Route simulation of this
design are shown in (Fig. 9) showing the delay for some input examples.
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 10
For i in 1 to 100 Loop
If (A /= B) Then
If (A > B) Then
A := A - B;
Else
B := B - A;
End If;
Else
GCD <= B;
End If;
End Loop;
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Fig.9 The Behavioural and Post-Route Simulation of FOR_LOOP Model
Furthermore, the complexity of the generated circuit was very high as the system has
converted the loop into a massive number of components. The system has just copied the
comparators, subtractors, and multiplexers a hundred times (i.e., no registers at all). (Fig. 10)
shows RTL, Technology Schematic, and Floor-plan (from PlanAhead) of FOR_LOOP Model.
Fig.10 RTL, Technology Schematic, and Floor-plan of FOR_LOOP Model
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 11
Behavioural Simulation
Post-Route Simulation
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
(TABLE 4) highlights the huge number of units that are mapped to satisfy FOR_LOOP
design requirements, whereas (TABLE 5) summarizes the Synthesis report including Timing.
TABLE 4: MAPPING REPORT OF THE FOR_LOOP GCD
TABLE 5: SYNTHESIS AND TIMING REPORT OF THE FOR_LOOP GCD
In this model, there is nothing could be done further except changing the maximum
number of the iterations which affects the performance (i.e., generates poorer latency for
higher max #iterations). In fact, it is supposed to be faster, see behavioural simulation, as the
design is purely parallel design. Yet, the huge circuitry raise the need to jump through
interconnection hoops. Finally, this model works faster in larger devices such as XC6SLX150.
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 12
Hardware Statistics XC6SLX25 Total Total Used %
# Slices 3,758 2,697 71.77%
# LUTs 15,032 8,776 58.38%
# MUXCYs 7,516 4,760 63.33%
# Registers 30,064 0 0.00%
# DSP 38 0 0.00%
# IOBs 226 48 21.24%
Macros Statistics
# 16-bit Add/Sub/Acc 198
# Registers 0
# 16-bit Comparators (=,>) 199
# 2-1 Multiplexers (16-bit) 298
# XOR 0
# DSP 0
# FSM 0
Time Element ns
Register to Register Paths 0
Input to Register Paths 0
Register to Out-pad Paths 0
In-pad to Out-pad Paths 478
Total Time Delay 478
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Behavioural Level: From AMS to FSM (ASM2FSM)
Recalling again the “Simplified Euclidean GCD Algorithm” in (Fig. 3), it can be
considered as an Arithmetic State Machine (ASM) that describes the behaviour of the GCD
circuit. Then, the three states FSM is an RTL implementation of the ASM circuit (Fig. 11).
Fig.11 From ASM GCD to Finite State Diagram
Using the basic “States Reduction” rule, S1 => S2. The new FSM with sample of the
code are shown in (Fig. 12).
Fig.12 The Reduced Finite State Diagram with VHDL Code
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 13
⇒
WHEN S0 => ELSIF (AR > BR) THEN
IF (Start = '1') THEN EnA <= '1';
NextState <= S1; EnB <= '0';
Else NextState <= S1;
NextState <= S0; Else
End If; EnA <= '0';
WHEN S1 => EnB <= '1';
AM <= AS; BM <= BS; NextState <= S1;
IF (AR = BR) THEN End If;
GCD <= BR; AS => AR - BR;
NextState <= S0; BS => BR - AR;
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
(Fig. 13) highlights the Behavioural Simulation results of the ASM2FSM Model, while
(Fig. 14) shows RTL, Technology Schematic, and Floor-plan (from PlanAhead).
Fig.13 The Behavioural Simulation of ASM2FSM Model
!
Fig.14 RTL, Technology Schematic, and Floor-plan of ASM2FSM Model
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 14
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Again, (TABLE 6 & 7) highlight the Mapping and Synthesis reports including Timing
and indicating dramatic improvement in terms of both, area and speed.
TABLE 6: MAPPING REPORT: ASM2FSM GCD VS. FOR_LOOP GCD
TABLE 7: SYNTHESIS AND TIMING REPORT - ASM2FSM GCD VS. FOR_LOOP GCD
There is no comparison between the results that was obtained with the ASM2FSM
model with the FOOR_LOOP ones, considering the huge area saving and the ability to
compute the GCD with very large number of iterations.
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 15
HW Statistics Total FOR_LOOP ASM2FSM
# Slices 3,758 2,697 71.77% 21 0.56%
# LUTs 15,032 8,776 58.38% 58 0.39%
# MUXCYs 7,516 4,760 63.33% 48 0.64%
# Registers 30,064 0 0.00% 33 0.11%
# DSP 38 0 0.00% 0 0.00%
# IOBs 226 48 21.24% 52 23.01%
Macros Statistics FOR ASM
# 16-bit Add/Sub/Acc 198 2
# Registers 0 33
# 16-bit Comparators (=,>) 199 2
# 2-1 Multiplexers (1, 16-bit) 298 8
# XOR 0 0
# DSP 0 0
# FSM 0 1
Time Element | ns FOR ASM
Register to Register Paths 0 5.07
Input to Register Paths 0 2.96
Register to Out-pad Paths 0 6.59
In-pad to Out-pad Paths 478 0
Total Time Delay 478 14.62
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Structural level: GCD Data-Path and Control Units (GCD2SUB)
The next step was to build a Data-Path for the computation unit, which could be as
shown in the block diagram in (Fig. 15).
Fig.15 Block Diagram of the “Original” GCD Data-Path
Because the comparison unit could be implemented as a subtractor, it was obvious to
use the CARRY_OUT signals of the subtractors in (Fig. 15) as AGB and ALB signals (Fig. 16).
Fig.16 Block Diagram of the Modified GCD Data-Path
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 16
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
The “FSM block” in (Fig. 16) refers to the Control Unit (Fig. 17) that drives the Control
signals of the GCD data-path (i.e., Registers’ Enable signals). It is important to note that the
MUXs’ select signals are driven by the signals AGB and AEB directly, whereas for the REGs’
enable signals, smaller MUXs (i.e., 1-bit) were built by the control unit.
Fig.17 The Control Unit (FSM) with VHDL Code
In this model, subtractors are the bottle neck of the design as they combined the
subtraction and comparison at the same time. They need to be as fast as their results must be
ready before the next clock occurrence. Therefor, fast CARRY4 primitive (Fig. 18.b), which
utilizes the dedicated Carry-Chain in SliceL and SliceM inside Spartan-6 FPGA, was adapted
in the design to perform faster subtraction. Furthermore, LUT2 and LUT3 Macros where used
to accommodate some logic functions such AND, XOR, and multiplexer (Fig. 18.a).
Fig.18.a Utilization of Primitive (FDCE), and Macros (LUT2 & LUT3)
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 17
AEB <= AGB NOR ALB; WHEN S1 =>
Finish <= AEB; EnA <= AGB;
WHEN S0 => EnB <= ALB;
IF (Start = '1') THEN IF (AEB = '1') THEN
NextState <= S1; NextState <= S1;
Else Else
NextState <= S0; NextState <= S0;
End If; End If;
MUX2x1_inst : LUT3 -- MUX 2x1 XOR_inst : LUT2 -- A XOR B
Generic (INTIT <= X“AC”;) Generic (INTIT <= X“6”;)
PORT MAP(O=>O, I2=>S, I1=>A, I0=>B); PORT MAP(O=>P, I1=>A, I1=>B);
FF_inst : FDCE -- Flip-Flop AND_inst : LUT2 -- A AND B
Generic (INTIT <= ‘0’;) Generic ( INTIT <= X”8”; )
PORT MAP(Q=>Q,C=>C,CE=>C,CLR=>C,D=>D); PORT MAP(O=>P, I1=>A, I1=>B);
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Fig.18.b Primitives: CARRY4 Fast Carry-Chain
(Fig. 19) shows RTL, Technology Schematic, and Floor-plan of GCD2SUB.
Fig.19 RTL, Technology Schematic, and Floor-plan of GCD2SUB Model
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 18
CARRY4_inst : CARRY4 PORT MAP (
CO => CO,-- 4-bit carry out
O => Sub ,-- 4-bit carry chain XOR data out
CI => ‘1’,-- 1-bit carry cascade input
CYINIT => ‘1’,-- 1-bit carry initialization
DI => A,-- 4-bit carry-MUX data in
S => P); -- 4-bit carry-MUX select input
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Perfectly, the Mapping and Synthesis reports (TABLE 9 & 10) prove the presumable
results of the design and it was clearly “Faster” and “Areas saver”.
TABLE 9: MAPPING REPORT: OPTIMIZED VS. SIMPLE GCD2SUB
TABLE 10: SYNTHESIS AND TIMING REPORT - OPTIMIZED VS. SIMPLE GCD2SUB
The comparison was between two versions of the GCD2SUB; The Optimized version
using primitives and macros, and a simple version with high level components (i.e., “-“ for
subtraction, “Select” for Multiplexer, …etc, even the comparator was defined in this version).
It was clear that although the tool is capable of Optimizing Macros in a good way, the
designer could utilize the dedicated Primitives and Macros for more efficient optimization.
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 19
HW Statistics Total Simple GCD2SUB Optimized GCD2SUB
# Slices 3,758 16 0.43% 15 0.40%
# LUTs 15,032 53 0.35% 51 0.34%
# MUXCYs 7,516 52 0.69% 32 0.43%
# Registers 30,064 33 0.11% 51 0.17%
# DSP 38 0 0.00% 0 0.00%
# IOBs 226 52 23.01% 52 23.01%
Macros Statistics S2S O2S
# 16-bit Add/Sub/Acc 2 0
# Registers 33 33
# 16-bit Comparators (=,>) 2 0
# 2-1 Multiplexers (1, 16-bit) 3 3
# XOR 0 0
# DSP 0 0
# FSM 1 1
Time Element | ns S2S O2S
Register to Register Paths 5.14 3.31
Input to Register Paths 3.07 2.96
Register to Out-pad Paths 6.72 3.67
In-pad to Out-pad Paths 0 0
Total Time Delay 14.93 9.94
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Structural level: GCD with Sum of Absolute Difference (GCDSAD)
Sum of Absolute Different (SAD) replaces the two subtractors using Carry-Out
Generation Function (Fig. 20 & 21). It expected to give better result than GCD2SUB as it uses
less components and produces less outputs.
Fig.20 GCD with Sum of Absolute Difference (GCDSAD)
Fig.21 Carry-out Generation Functions for SAD
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 20
GPBLOCK: FOR i IN 1 TO (N/4) generate PB8<=PB4(1)AND PB4(2);
PB4(i)<=P(4*i-1) AND P(4*i-2) AND GB8<=GB4(2)OR(GB4(1)
P(4*i-3) AND P(4*i-4); AND PB4(2));
GB4(i)<= G(4*i-1)OR(G(4*i-2)AND P(4*i-1)) GN<= G(3)OR(G(2)AND P(3))
OR(G(4*i-3)AND P(4*i-1)AND P(4*i-2))OR OR(G(1)AND P(3)AND P(2))OR
(G(4*i-4)AND P(4*i-1)AND P(4*i-2)AND P(4*i-3)); (G(0)AND P(3)AND P(2)AND P(1));
END Generate GPBLOCK; CO<=GB4(4)OR(PB4(4)AND C12);
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Before implementing the primitive of the GCDSAD circuit (Optimized GCDSAD), there
was an attempt to try a function called (ABS), which does the same job as GCDSAD, in order
to see how the compiler accommodates such function in the hardware level. Also, the simple
GCDSAD has been designed using high level component definition. ABS_GCD has given a
significant result in terms of speed, while the simple GCDSAD was a bit better in terms of
area. (TABLE 11 & 12) compare between ABSGCD, Simple, and Optimized GCDSAD.
TABLE 11: MAPPING REPORT: GCDSAD
TABLE 12: SYNTHESIS AND TIMING REPORT - GCDSAD
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 21
HW Stat. Total ABSGCD SGCDSAD OGCDSAD
# Slices 3,758 22 0.59% 20 0.53% 18 0.48%
# LUTs 15,032 73 0.49% 62 0.41% 59 0.39%
# MUXCYs 7,516 52 0.69% 16 0.21% 16 0.21%
# Registers 30,064 34 0.11% 33 0.11% 41 0.14%
# DSP 38 0 0.00% 0 0.00% 0 0.00%
# IOBs 226 52 23.01% 52 23.01% 52 23.01%
Macros Statistics ABS SSAD OSAD
# 16-bit Add/Sub/Acc 2 1 0
# Registers 33 33 33
# 2-1 MUX (1, 16-bit) 5 5 3
# XOR 15 33 0
# DSP 0 0 0
# FSM 1 1 0
Time Element | ns ABS SSAD OSAD
R to R Paths 5.40 10.92 8.40
In to R Paths 3.19 3.19 2.96
R to Out Paths 5.80 13.11 3.63
In to Out Paths 0 0 0
Total Time Delay 14.39 27.22 14.99
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Overview of the Results
Recalling all the design architectures and their area/time figures, this section reveals
the conclusion in numbers and charts (TABLE 13 & Fig. 22, & Fig. 23).
Summary of the results for the different architectures
TABLE 13: OVERVIEW OF THE RESULTS (O; OPTIMIZED WITH PRIMITIVES)
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 22
HW Stat. Total FOR ASM OGCD2SUB OGCDSAD
# Slices 3,758 2,697 71.77% 21 0.56% 15 0.40% 18 0.48%
# LUTs 15,032 8,776 58.38% 58 0.39% 51 0.34% 59 0.39%
# MUXCYs 7,516 4,760 63.33% 48 0.64% 32 0.43% 16 0.21%
# Registers 30,064 0 0.00% 33 0.11% 51 0.17% 41 0.14%
# IOBs 226 48 21.24% 52 23.01% 52 23.01% 52 23.01%
Macros Statistics FOR ASM OGCD2SUB OGCDSAD
# 16-bit Add/Sub/Acc 198 2 0 0
# Registers 0 33 33 33
# 16-bit Comparators (=,>) 199 2 0 0
# 2-1 MUX (1, 16-bit) 298 8 3 3
Time Element | ns FOR ASM OGCD2SUB OGCDSAD
Register to Register Paths 0.00 5.07 3.31 8.40
Input to Register Paths 0.00 2.96 2.96 2.96
Register to Out-pad Paths 0.00 6.59 3.67 3.63
In-pad to Out-pad Paths 478.00 0.00 0.00 0.00
Total Time Delay 478.00 14.62 9.94 14.99
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
The results in a Chart:
Fig.22 Results in a Chart (FOR_LOOP dominated)
The Area-Delay Product (ASM, PGCD2SUB, and PGCDSAD)
Fig.23 The Area-Delay Product
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 23
#SLICES TIME DELAY AREA*DELAY
#SLICES TIME DELAY AREA*DELAY
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Summary and Conclusion
(TABLE 14) Summarizes the work that has been done and compares between all the
versions of the Euclidean GCD design and its implementation on Xilix Spartan-6 FPGA.
TABLE 14: COMPARISON SUMMARY BETWEEN ARCHITECTURES
The overall results shows that the optimized GCD2SUB design has the least Area-Delay
product among the other models in this project, which means it provides fast computation of
the Euclidean GCD Algorithm, while saving area a great deal. Apart from the slow, limited
and area consuming FOOR_LOOP GCD model, the other architectures were not too far for
GCD2SUB model, especially, the Simple GCD2Sub and the Optimized GCDSAD. However,
GCDSAD could be better than Simple GCD2Sub because of the full control over placement
which might make its Area-Delay product significantly better. Furthermore, some
components, such as FSM Flip-Flops and MUXs, could be implemented using primitives and
placed efficiently in order to provide more reduction on the Area-Delay product.
Factors
Architectures
Area*Delay Macros & Primitives
Design features &
Synthesizability
Behavioural
For Loop 7 1289166
198 16bit Sub, 199 16bit comp,
298 16bit MUX-2x1
Depends on max. #iterations,
works faster in larger devices
ASM2FSM 4 307.02
2 16bit Loadable Accumulators
33 Registers, 2 16bit Comp,
2 16bit and 6 1-bit MUX-2x1
FSM is the Top Entity,
Depends on compiler Macros
Still no control over placement
Simple
Structural
GCD2Sub 2 238.88
2 16bit Loadable Accumulators
33 Registers, 2 16bit Comp,
3 1-bit MUX-2x1
Datapath & FSM Control Units
Depends on components def.
Still no control over placement
GCDSAD 6 544.4
16bit Add (w Cin), 33 Registers
2 16bit and 3 1-bit MUX-2x1
1 16bit and 32 1-bit XOR
DP &CU, Dep. on components,
Still no control over placement
Utilizes SAD circuits (1 ADD)
GCDABS 5 316.58
16bit Sub, 16bit Add, 15 XOR
33 Registers, 2 16bit Comp,
2 16bit and 3 1-bit MUX-2x1
DP &CU, Dep. on components
and the compiler Macros
Still no control over placement
Optimized
Strucutral
GCD2Sub 1 149.1
33 Registers,
3 1-bit MUX-2x1
DP &CU, Utilizes Primitives
(LUT2,3 & CARRY4), Fast,
Full Control over placement
GCDSAD 3 269.82
33 Registers,
3 1-bit MUX-2x1
DP &CU, Utilizes Primitives
(LUT2,3 & CARRY4), Smart,
Full Control over placement
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 24
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Final Thoughts and Suggestions
The Euclidean GCD Algorithm design journey has brought great experience, from
infinite loop to loop limitations, then thought RTL behavioural architecture to the structural
architecture, to the optimized design, where, Primitives and Macros were utilized to reduce
the Area-Delay product of the design. Pro's & Con's of the main architectures can be:
Behavioural
- Apart from “Design Strategies & Goals,” there in no control at any level on the implemented
circuit or the placement and routing of the design.
✤ It is High Level Coding approach, which is easier to write and manage.
Structural
- The design could be much more complex than behavioural especially with Primitives.
✤ By utilizing Primitives & Macros efficiently, there is gain of full control over the placement.
✤ Having the data-path and control units separated, allows for better optimization.
It is important to note that utilizing primitives and macros efficiently helps to reduce the
jumps through interconnections and maintain a logical and persistent data flow in the design.
For instance, 16-bit Carry-Look-Ahead Subtractor (CLASub) is assumed to be faster than the
ripple carry. However, utilizing CARRY4 primitive to benefit form the dedicated Carry-chain
with the help of propagation function (i.e., Half-Adder SUM - XOR), gives an Area-Delay
product of about 10 times better than using full CLASub with primitives.
Finally, it would be fair to mention that both, ASM2FSM, Simple GCD2SUB, and Simple
GCDSAD were implemented with the enforcement of using DSP as a primitive. The time
delay in all cases was not promising, and the occupied area inside the FPGA was greater than
using the CLB’s slices. However, it seems somehow possible to utilize the DSP itself in order
to benefit from its features to perform the whole computations of the Euclidean GCD
Algorithm. This might be a reasonable suggestion for future work related to GCD design on
FPGA in addition to learning more about the tools and their helpful features.

IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 25
ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Bibliography
1. Wikipedia, (2014). Greatest common divisor. [online] Available at: http://en.wikipedia.org/wiki/
Greatest_common_divisor [Accessed 15 Dec. 2014].

2. EE254L – GCD (University of South California): Subject Lab Manual: http://www-classes.usc.edu/engr/ee-
s/254/ee254l_lab_manual/.

3. Lesson 93 - Example 63: GCD Algorithm - VHDL while Statement [A tutorial on datapaths and state
machines for computing the GCD / While Loops accompanies the book Digital Design Using Digilent FPGA
Boards]. (Nov 2012). LBEbooks. Retrieved from https://www.youtube.com/watch?v=DMSaYhD1GkM.

4. “Spartan-6 Family Overview (v2.0),” Xilinx, 2011. http://www.xilinx.com/support/documentation/
data_sheets/ds160.pdf.

5. “Spartan-6 FPGA Configurable Logic Block, UG384 (v1.1),” Xilinx, 2010. http://www.xilinx.com/support/
documentation/user_guides/ug384.pdf.

6. “Spartan-6 Libraries Guide for HDL Designs, UG615 (v 14.1),” Xilinx, 2012. http://www.xilinx.com/support/
documentation/sw_manuals/xilinx14_1/spartan6_hdl.pdf.

7. “XST User Guide for Virtex-6, Spartan-6, and 7 Series Devices, UG687 (v 13.4),” Xilinx, 2012. http://
www.xilinx.com/support/documentation/sw_manuals/xilinx14_1/xst_v6s6.pdf.

8. “ISE In-Depth Tutorial, UG695 (v 12.1)”, Xilinx , 2009. http://www.xilinx.com/support/documentation/
sw_manuals/xilinx14_1/spartan6_hdl.pdf.

9. Sima, M. (2014). ELEC669 'Reconfigurable Computing. -[Lecture Notes]

10. Devi, R., Singh, J. and Singh, M. (2011). VHDL Implementation of GCD Processor with Built in Self Test
Feature. International Journal of Computer Applications, 25(2), pp.50-54.

11. C.P, N. and M. Ravi Kumar, K. (2014). Efficient Comparator based Sum of Absolute Differences Architecture
for Digital Image Processing Applications. International Journal of Computer Applications, 96(4), pp.17-24.

12. TechOnlineIndia, (2014). An introduction to FPGA timing analysis [online] Available at http://
www.techonlineindia.com/techonline/news_and_analysis/170126/introduction-fpga-timing-analysis.

IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 26

More Related Content

What's hot

Carry look ahead adder
Carry look ahead adderCarry look ahead adder
Carry look ahead adderdragonpradeep
 
Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) A B Shinde
 
Introduction To Parallel Computing
Introduction To Parallel ComputingIntroduction To Parallel Computing
Introduction To Parallel ComputingJörn Dinkla
 
Embedded system custom single purpose processors
Embedded system custom single  purpose processorsEmbedded system custom single  purpose processors
Embedded system custom single purpose processorsAiswaryadevi Jaganmohan
 
Chapter 2: Boolean Algebra and Logic Gates
Chapter 2: Boolean Algebra and Logic GatesChapter 2: Boolean Algebra and Logic Gates
Chapter 2: Boolean Algebra and Logic GatesEr. Nawaraj Bhandari
 
Timing and-control-unit
Timing and-control-unitTiming and-control-unit
Timing and-control-unitAnuj Modi
 
Lecture 2 verilog
Lecture 2   verilogLecture 2   verilog
Lecture 2 verilogvenravi10
 
Pipeline processing - Computer Architecture
Pipeline processing - Computer Architecture Pipeline processing - Computer Architecture
Pipeline processing - Computer Architecture S. Hasnain Raza
 
Virtual memory managment
Virtual memory managmentVirtual memory managment
Virtual memory managmentSantu Kumar
 
Computer architecture pipelining
Computer architecture pipeliningComputer architecture pipelining
Computer architecture pipeliningMazin Alwaaly
 
DIgital clock using verilog
DIgital clock using verilog DIgital clock using verilog
DIgital clock using verilog Abhishek Sainkar
 

What's hot (20)

Computer Architecture
Computer ArchitectureComputer Architecture
Computer Architecture
 
Carry look ahead adder
Carry look ahead adderCarry look ahead adder
Carry look ahead adder
 
Input output module
Input output moduleInput output module
Input output module
 
Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism)
 
Direct memory access
Direct memory accessDirect memory access
Direct memory access
 
Introduction To Parallel Computing
Introduction To Parallel ComputingIntroduction To Parallel Computing
Introduction To Parallel Computing
 
Embedded system custom single purpose processors
Embedded system custom single  purpose processorsEmbedded system custom single  purpose processors
Embedded system custom single purpose processors
 
Chapter 2: Boolean Algebra and Logic Gates
Chapter 2: Boolean Algebra and Logic GatesChapter 2: Boolean Algebra and Logic Gates
Chapter 2: Boolean Algebra and Logic Gates
 
Timing and-control-unit
Timing and-control-unitTiming and-control-unit
Timing and-control-unit
 
Branch prediction
Branch predictionBranch prediction
Branch prediction
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
FPGA
FPGAFPGA
FPGA
 
Virtual memory ppts
Virtual memory pptsVirtual memory ppts
Virtual memory ppts
 
First order logic
First order logicFirst order logic
First order logic
 
Lecture 2 verilog
Lecture 2   verilogLecture 2   verilog
Lecture 2 verilog
 
Pipeline processing - Computer Architecture
Pipeline processing - Computer Architecture Pipeline processing - Computer Architecture
Pipeline processing - Computer Architecture
 
Unit VI CPLD-FPGA Architecture
Unit VI CPLD-FPGA ArchitectureUnit VI CPLD-FPGA Architecture
Unit VI CPLD-FPGA Architecture
 
Virtual memory managment
Virtual memory managmentVirtual memory managment
Virtual memory managment
 
Computer architecture pipelining
Computer architecture pipeliningComputer architecture pipelining
Computer architecture pipelining
 
DIgital clock using verilog
DIgital clock using verilog DIgital clock using verilog
DIgital clock using verilog
 

Similar to GCD-FPGA-Based-DesignE

Seminar on field programmable gate array
Seminar on field programmable gate arraySeminar on field programmable gate array
Seminar on field programmable gate arraySaransh Choudhary
 
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORMDUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORMVLSICS Design
 
MF-RALU: design of an efficient multi-functional reversible arithmetic and l...
MF-RALU: design of an efficient multi-functional reversible  arithmetic and l...MF-RALU: design of an efficient multi-functional reversible  arithmetic and l...
MF-RALU: design of an efficient multi-functional reversible arithmetic and l...IJECEIAES
 
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDICImplementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDICijtsrd
 
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...IRJET Journal
 
A LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGA
A LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGAA LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGA
A LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGAIRJET Journal
 
Programmable logic controller performance enhancement by field programmable g...
Programmable logic controller performance enhancement by field programmable g...Programmable logic controller performance enhancement by field programmable g...
Programmable logic controller performance enhancement by field programmable g...ISA Interchange
 
Design and analysis of optimized CORDIC based GMSK system on FPGA platform
Design and analysis of optimized CORDIC based  GMSK system on FPGA platform Design and analysis of optimized CORDIC based  GMSK system on FPGA platform
Design and analysis of optimized CORDIC based GMSK system on FPGA platform IJECEIAES
 
Run time dynamic partial reconfiguration using microblaze soft core processor...
Run time dynamic partial reconfiguration using microblaze soft core processor...Run time dynamic partial reconfiguration using microblaze soft core processor...
Run time dynamic partial reconfiguration using microblaze soft core processor...eSAT Journals
 
Run time dynamic partial reconfiguration using
Run time dynamic partial reconfiguration usingRun time dynamic partial reconfiguration using
Run time dynamic partial reconfiguration usingeSAT Publishing House
 
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
 
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
 
11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond Moore11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond MooreRCCSRENKEI
 
Final presentation [dissertation project], 20192 esv0002
Final presentation [dissertation project], 20192 esv0002Final presentation [dissertation project], 20192 esv0002
Final presentation [dissertation project], 20192 esv0002MOHAMMED FURQHAN
 
HARDWARE/SOFTWARE CO-DESIGN OF A 2D GRAPHICS SYSTEM ON FPGA
HARDWARE/SOFTWARE CO-DESIGN OF A 2D GRAPHICS SYSTEM ON FPGAHARDWARE/SOFTWARE CO-DESIGN OF A 2D GRAPHICS SYSTEM ON FPGA
HARDWARE/SOFTWARE CO-DESIGN OF A 2D GRAPHICS SYSTEM ON FPGAijesajournal
 
A Low power and area efficient CLA adder design using Full swing GDI technique
A Low power and area efficient CLA adder design using Full swing GDI techniqueA Low power and area efficient CLA adder design using Full swing GDI technique
A Low power and area efficient CLA adder design using Full swing GDI techniqueIJERA Editor
 

Similar to GCD-FPGA-Based-DesignE (20)

Seminar on field programmable gate array
Seminar on field programmable gate arraySeminar on field programmable gate array
Seminar on field programmable gate array
 
Fpg as 11 body
Fpg as 11 bodyFpg as 11 body
Fpg as 11 body
 
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORMDUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
 
MF-RALU: design of an efficient multi-functional reversible arithmetic and l...
MF-RALU: design of an efficient multi-functional reversible  arithmetic and l...MF-RALU: design of an efficient multi-functional reversible  arithmetic and l...
MF-RALU: design of an efficient multi-functional reversible arithmetic and l...
 
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDICImplementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
 
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
 
A LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGA
A LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGAA LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGA
A LIGHT WEIGHT VLSI FRAME WORK FOR HIGHT CIPHER ON FPGA
 
Programmable logic controller performance enhancement by field programmable g...
Programmable logic controller performance enhancement by field programmable g...Programmable logic controller performance enhancement by field programmable g...
Programmable logic controller performance enhancement by field programmable g...
 
Design and analysis of optimized CORDIC based GMSK system on FPGA platform
Design and analysis of optimized CORDIC based  GMSK system on FPGA platform Design and analysis of optimized CORDIC based  GMSK system on FPGA platform
Design and analysis of optimized CORDIC based GMSK system on FPGA platform
 
Run time dynamic partial reconfiguration using microblaze soft core processor...
Run time dynamic partial reconfiguration using microblaze soft core processor...Run time dynamic partial reconfiguration using microblaze soft core processor...
Run time dynamic partial reconfiguration using microblaze soft core processor...
 
Run time dynamic partial reconfiguration using
Run time dynamic partial reconfiguration usingRun time dynamic partial reconfiguration using
Run time dynamic partial reconfiguration using
 
Introduction to EDA Tools
Introduction to EDA ToolsIntroduction to EDA Tools
Introduction to EDA Tools
 
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
 
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
 
11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond Moore11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond Moore
 
VLSI GDI Technology
VLSI GDI TechnologyVLSI GDI Technology
VLSI GDI Technology
 
Final presentation [dissertation project], 20192 esv0002
Final presentation [dissertation project], 20192 esv0002Final presentation [dissertation project], 20192 esv0002
Final presentation [dissertation project], 20192 esv0002
 
HARDWARE/SOFTWARE CO-DESIGN OF A 2D GRAPHICS SYSTEM ON FPGA
HARDWARE/SOFTWARE CO-DESIGN OF A 2D GRAPHICS SYSTEM ON FPGAHARDWARE/SOFTWARE CO-DESIGN OF A 2D GRAPHICS SYSTEM ON FPGA
HARDWARE/SOFTWARE CO-DESIGN OF A 2D GRAPHICS SYSTEM ON FPGA
 
A novel reduced instruction set computer-communication processor design usin...
A novel reduced instruction set computer-communication  processor design usin...A novel reduced instruction set computer-communication  processor design usin...
A novel reduced instruction set computer-communication processor design usin...
 
A Low power and area efficient CLA adder design using Full swing GDI technique
A Low power and area efficient CLA adder design using Full swing GDI techniqueA Low power and area efficient CLA adder design using Full swing GDI technique
A Low power and area efficient CLA adder design using Full swing GDI technique
 

More from Ibrahim Hejab

More from Ibrahim Hejab (9)

ResearchProject2009
ResearchProject2009ResearchProject2009
ResearchProject2009
 
Mawheb7
Mawheb7Mawheb7
Mawheb7
 
Mawheb6
Mawheb6Mawheb6
Mawheb6
 
Mawheb5
Mawheb5Mawheb5
Mawheb5
 
Mawheb1
Mawheb1Mawheb1
Mawheb1
 
PACRIM15_Presentation_iHaz
PACRIM15_Presentation_iHazPACRIM15_Presentation_iHaz
PACRIM15_Presentation_iHaz
 
FPGA_BasedGCD
FPGA_BasedGCDFPGA_BasedGCD
FPGA_BasedGCD
 
FPGA_Overview_Ibr_2014
FPGA_Overview_Ibr_2014FPGA_Overview_Ibr_2014
FPGA_Overview_Ibr_2014
 
ProjectResearchReport2009
ProjectResearchReport2009ProjectResearchReport2009
ProjectResearchReport2009
 

GCD-FPGA-Based-DesignE

  • 1. GCD FPGA-Based Design Ibrahim Hazmi - V00835716
 Design and Implementation of the Euclidean Algorithm for Computing the Greatest Common Divisor using ELEC569A Project Final Report (Fall, 2014) Supervised by Dr. Mihai Sima
  • 2. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Contents GCD FPGA-Based Design i Contents ii List of Figures iii List of Tables iv Visual Executive Summary 1 Introduction 2 Background of GCD and Euclidean Algorithm 2 Overview of Spartan-6 FPGA 4 Project Description and Milestones 7 Detailed Description of the Design 9 Behavioural Level: WHILE/FOR LOOP 9 Behavioural Level: From AMS to FSM (ASM2FSM) 13 Structural level: GCD Data-Path and Control Units (GCD2SUB) 16 Structural level: GCD with Sum of Absolute Difference (GCDSAD) 20 Overview of the Results 22 Summary of the results for the different architectures 22 The results in a Chart: 23 The Area-Delay Product (ASM, PGCD2SUB, and PGCDSAD) 23 Summary and Conclusion 24 Final Thoughts and Suggestions 25 Bibliography 26 IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 ii
  • 3. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 List of Figures Fig.1 Prime Factorization method for finding the GCD of two integers 2 Fig.2 Euclidean Algorithm 3 Fig.3 Simplified Euclidean GCD Algorithm 3 Fig.4 XC6SLX25 Floor-plan View in PlanAhead 5 Fig.5 Closer View of the XC6SLX25 Floor-plan View in PlanAhead 6 Fig.6 The Design Strategy Window from Xilinx Project Navigator 8 Fig.7 WHILE_LOOP translations of the Simplified Euclidean GCD Algorithm 9 Fig.8 Finite FOR_LOOP as a replacement for the Infinite WHILE_LOOP 10 Fig.9 The Behavioural and Post-Route Simulation of FOR_LOOP Model 11 Fig.10 RTL, Technology Schematic, and Floor-plan of FOR_LOOP Model 11 Fig.11 From ASM GCD to Finite State Diagram 13 Fig.12 The Reduced Finite State Diagram with VHDL Code 13 Fig.13 The Behavioural Simulation of ASM2FSM Model 14 Fig.14 RTL, Technology Schematic, and Floor-plan of ASM2FSM Model 14 Fig.15 Block Diagram of the “Original” GCD Data-Path 16 Fig.16 Block Diagram of the Modified GCD Data-Path 16 Fig.17 The Control Unit (FSM) with VHDL Code 17 Fig.18.a Utilization of Primitive (FDCE), and Macros (LUT2 & LUT3) 17 Fig.18.b Primitives: CARRY4 Fast Carry-Chain 18 Fig.19 RTL, Technology Schematic, and Floor-plan of GCD2SUB Model 18 Fig.20 GCD with Sum of Absolute Difference (GCDSAD) 20 Fig.21 Carry-out Generation Functions for SAD 20 Fig.22 Results in a Chart (FOR_LOOP dominated) 23 Fig.23 The Area-Delay Product 23 IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 iii
  • 4. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 List of Tables Table 1: Xilinx spartan-6 FPGA Feature Summary [4], [5] 4 Table 2: Slice features of spartan-6 Family (including XC6SLX25) [5] 6 Table 3: Project Milestones 8 Table 4: Mapping Report of the FOR_LOOP GCD 12 Table 5: Synthesis and Timing Report of the FOR_LOOP GCD 12 Table 6: Mapping Report: ASM2FSM GCD vs. FOR_LOOP GCD 15 Table 7: Synthesis and Timing Report - ASM2FSM GCD vs. FOR_LOOP GCD 15 Table 9: Mapping Report: Optimized vs. Simple GCD2SUB 19 Table 10: Synthesis and Timing Report - Optimized vs. Simple GCD2SUB 19 Table 11: Mapping Report: GCDSAD 21 Table 12: Synthesis and Timing Report - GCDSAD 21 Table 13: Overview of The results (O; Optimized with Primitives) 22 Table 14: Comparison Summary between Architectures 24 IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 iv
  • 5. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Visual Executive Summary
 IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 1 The main idea of this project is to design a Digital Circuit that calculates the GCD of two 16- bit unsigned integer numbers using Euclidean Algorithm and Implement it on Xilinx Spartan6 FPGA using different techniques/architectures. The first attempt was to see how far the compiler goes with the behavioural loop that represents Euclidean Algorithm. Because the tools kept copying the hardware inside the loop all the time, a massive area of the FPGA was occupied and the number of iterations was limited. Thus, an RTL behavioural architecture was implemented, in which only one iteration can run per each clock cycle. The compiler still have the freedom for placement and routing with the aid of “Design and Goals Strategies”. Then, the design was built structurally, by port-mapping all functions of the previous design as components, to see how the compiler is going to utilize the FPGA differently from the behavioural one. The structural model consists of two parts: GCD data-path unit and GCD control unit (FSM). Another version of the structural design was created as an attempt to adapt the idea of the “Sum of Absolute Difference (SAD)” in order to have only one subtraction instead of two. Finally, Spartan-6 Primitives and Macros were utilized to reduce the Area-Delay product of the design, and the optimized GCD with two subtractors has been proved to give the minimum Area-Delay product among all other design architectures.
  • 6. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Introduction Implementing mathematical calculations on hardware platforms such as FPGA is quite more challenging than performing them in a software environment, where the hardware itself is already equipped with the calculation data-path and control unit for almost infinite number of algorithms and arithmetic operations. Behind this pain of the hardware implementation is a priceless gain in terms of performance as there is a great opportunity to utilize smaller area, obtain higher speed, consume less power, or get a reasonable combination of all of these. Calculating The greatest common divisor (GCD), is one of the problems that need number of steps in order to be solved correctly. These steps can be transformed into an iterative algorithm such as Euclidean algorithm, which makes the computation understandable and traceable. This section is divided into three parts; a brief mathematical background about the GCD computation, an overview of Xilinx Spartan-6 FPGA, and an outline of the project description highlighting its objective and milestones. Background of GCD and Euclidean Algorithm The greatest common divisor (GCD) of two positive integers is the largest integer that divides both numbers without a remainder [2]. It is also know as Greatest Common Factors (GCF), Greatest Common Measure (GCM), Highest Common Divisor (HCD), or Highest Common Factor (HCF) [1]. GCD can be computed by determining the prime factors of both numbers, then multiplying the common prime factors. Practically, this method is not feasible for great numbers. (Fig. 1) shows an example of how prime factorization method works. Fig.1 Prime Factorization method for finding the GCD of two integers IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 2
  • 7. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 An efficient method for solving GCD problems is Euclidean algorithm, which is based on the fact that the GCD of two numbers divides the remainder of the division between them: It is an iterative process (Fig. 2), that takes a number of cycles to compute the GCD. Divisions are done iteratively until rn = 0, is obtained, then, the GCD = rn-1. Fig.2 Euclidean Algorithm As division is simply a subtraction, it was observed that the GCD of two numbers also divides their difference [1], in which the design and implementation of the circuit gets easier. The flowchart in (Fig. 3) illustrates this simple computation process of the GCD. So, it is obvious that the circuit should include subtraction and comparison units, in addition to registers and multiplexers for data update in each iteration cycle. Fig.3 Simplified Euclidean GCD Algorithm IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 3 gcd(a,b) = gcd(b,r) where, a = qb + r gcd(a,b) = gcd(b,(a − b)) = gcd(a,(b − a)) gcd(a,b) = gcd(b,r1) gcd(b,r1) = gcd(r1,r2 ) !
  • 8. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Overview of Spartan-6 FPGA From the previous part, the chosen FPGA should have some properties to accommodate the design units efficiently. For example, subtraction/addition may take advantage of some dedicated components in the FPGA slices such as ripple carry-chain or DSP. Spartan-6 FPGA family from Xilinx provides the designers with such components, which would help a great deal in designing GCD circuit in different levels. “The thirteen-member family delivers expanded densities ranging from 3,840 to 147,443 logic cells, with half the power consumption of previous Spartan families, and faster, more comprehensive connectivity,” [4]. (TABLE 1) shows a feature summary of some devices from this family; the smallest (XC6SLX4), the largest (XC6SLX150T), and the choice of this project (XC6SLX25), which was the smallest member of the family to accommodate the first design, i.e., the reference one. More about this device and the reference design is presented later in this report. TABLE 1: XILINX SPARTAN-6 FPGA FEATURE SUMMARY [4], [5] In the next two pages, the Floor-plan views in PlanAhead for the device XC6SLX25 help to illustrate the internal construction of the chosen device. (Fig. 4) is a full-scale floor-plan view that shows the device layout indicating some important elements such as IOB cells, CLBs, DSP, and block RAM columns. Device Logic
 Cells CLB DSP Slices RAM Blocks User I/OSlices FFs RAM (kb) LUT6 XC6SLX4 3,840 600 4,800 75 2,400 8 12 132 XC6SLX25 24,051 3,758 30,046 229 15,032 38 52 226 XC6SLX150T 147,443 23,038 184,304 1,355 92,152 180 268 540 IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 4
  • 9. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Fig.4 XC6SLX25 Floor-plan View in PlanAhead In (Fig. 5), a closer view of the layout reveals the three different slices inside the Configurable Logic Block (CLB) surrounding a DSP block. It is clear from the figure that Each CLB contains two slices, one of them is SLICEX and the other one is either SLICEL or SLICEM. (TABLE 2) presents important slice features of the XC6SLX25 device. IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 5 ` Memory Controller Block Block RAM Column Clock Management Tile Column DSP Column CLB Cell IOB Cells
  • 10. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Fig.5 Closer View of the XC6SLX25 Floor-plan View in PlanAhead TABLE 2: SLICE FEATURES OF SPARTAN-6 FAMILY (INCLUDING XC6SLX25) [5] IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 6 DSP Flip- Flops Carry- Chain Storage LUTs LUT6 Slices SLICEX SLICEL SLICEM 6-Input LUTs √ √ √ 8 Flip-flops √ √ √ Wide Multiplexers √ √ Carry Logic √ √ Distributed RAM √ Shift Registers √
  • 11. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Project Description and Milestones The objective of this project is to design a Digital Circuit that calculates the GCD of two 16-bit unsigned integer numbers using Euclidean Algorithm and Implement it on Xilinx Spartan6 FPGA. In this project, the Euclidean GCD circuit was implemented using different architectures in order to examine the tradeoff between area and speed, i.e., Area-Delay product, and decide which design is more implementable in terms of dedicated configuration components inside the FPGA. The first step was to implement a simple behavioural loop, i. e., a direct interpretation of the Euclidean GCD Algorithm, using FOR_LOOP to see how the compiler would represent a large number of iterations. Considering this design reference, the next step was to implement the Euclidean GCD circuit in the following levels: A. RTL Behavioural level, where the design is simply a Finite State Machine (FSM) that performs the GCD calculation sequentially as a lower level of interpreting the Algorithm. In this case, the compiler was free to translate the operations into different units/ components and place all these components and rout all the connections with the aid of “Design and Goals Strategies” optimization. B. Structural level, where the data flow of the GCD Algorithm is transformed into an arithmetic circuit, i.e., data-path unit (DP), and the iteration process is attained by a simple control unit (CU), FSM basically. The design was built abstractly transforming all functions, such as comparison, subtraction, and data transfer, to components and port- mapping them in a top level entity, to see how the compiler would utilize the FPGA differently from the behavioural implementation. In this level, “Sum of Absolute Difference,” (SAD) circuit has been introduced in order to replace the two subtraction units with one computation unit. Finally, some functions were designed utilizing Primitives, e.g., Look-Up-Tables “LUTs,” Carry-Chains, Flip-Flops, Exclusive-Ores, and/or DSPs, and Macros, e.g., ADDSUB macro, with which the calculation unit has been optimized in terms of occupied area inside the FPGA. IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 7
  • 12. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 (TABLE 3) is an outline of the project milestones, and how much approximately was achieved during the journey of the ELEC569A course. TABLE 3: PROJECT MILESTONES In the next section all of the above stages will be presented and discussed in sequence. It is helpful, by this point, to mention that the target is to obtain minimum Area-Delay product, which could be achieved by reducing the area and/or the time delay of the circuit. By determining optimization gaol to be area (Fig. 6), smaller area of the FPGA will be utilized in order to reduce Area*Delay. At the same time, it might also lead to higher speed, assuming that the smaller area is obtained, the fewer jumps through interconnections is needed. Fig.6 The Design Strategy Window from Xilinx Project Navigator
 IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 8 Description Done Design the Simple Behavioural Loop and examine its aspects and limitations 100 Design the Behavioural FSM Model and test its features and margins 100 Design the direct Structural Model (DP+CU) and compare it with the behavioural 100 Design the Optimized Structural Model (SAD) and compare it with the direct structural 100 Get into Primitive Level and utilize the dedicated elements for faster computation 100 Report the Area/Delay Comparison between all the architectures and propose suggestions 100
  • 13. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Detailed Description of the Design As mentioned in the previous section, a simple behavioural loop was implemented first, in order to see how the compiler understands loops and how it deals with a large number of iterations. Then, the results of this design, as a reference, kept pushing towards trying different architectures in order to obtain smaller area and less jump through the interconnection hoops. Behavioural Level: WHILE/FOR LOOP Starting with the direct WHILE_LOOP, that represents the Euclidean GCD Algorithm (Fig. 7), the result tells a lot about how the system treats loops. It was clear that the compiler has just copied the corresponding circuit along the way until the loop ends. It was not too surprising that the compiler did not synthesize the (While) function, simply, because it generates an infinite loop, which means the number of the circuit copies is infinity. In hardware world, infinity does not usually exist, it needs to be a finite number. Fig.7 WHILE_LOOP translations of the Simplified Euclidean GCD Algorithm IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 9 While (A /= B) Then If (A > B) Then A := A - B; Else B := B - A; End If; End Loop; GCD <= B;
  • 14. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Thus, the transition to the Finite FOR_LOOB (Fig. 8) was obvious, where the maximum number of iterations must be defined from the beginning. In fact, determining the number of iterations before even starting computing the GCD creates limitation to the design, with which almost an infinite number of GCD calculations will return zero. Fig.8 Finite FOR_LOOP as a replacement for the Infinite WHILE_LOOP Before going through the design and implementation results of this model and proceeding to the other different levels, it is essential to point out that the WHILE_LOOP Model is a perfect transformation of the Euclidean GCD Algorithm. Therefore, all the following design effort would be considered as an attempt to produce a synthesizable version of the WHILE_LOOP Model. The number of iterations in the FOR_LOOP Model was defined as 100, which means that for any two numbers that require more than hundred iterations to compute their GCD (e.g., 511 and 2), the result will be zero. The behavioural and Post-Route simulation of this design are shown in (Fig. 9) showing the delay for some input examples. IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 10 For i in 1 to 100 Loop If (A /= B) Then If (A > B) Then A := A - B; Else B := B - A; End If; Else GCD <= B; End If; End Loop;
  • 15. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Fig.9 The Behavioural and Post-Route Simulation of FOR_LOOP Model Furthermore, the complexity of the generated circuit was very high as the system has converted the loop into a massive number of components. The system has just copied the comparators, subtractors, and multiplexers a hundred times (i.e., no registers at all). (Fig. 10) shows RTL, Technology Schematic, and Floor-plan (from PlanAhead) of FOR_LOOP Model. Fig.10 RTL, Technology Schematic, and Floor-plan of FOR_LOOP Model IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 11 Behavioural Simulation Post-Route Simulation
  • 16. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 (TABLE 4) highlights the huge number of units that are mapped to satisfy FOR_LOOP design requirements, whereas (TABLE 5) summarizes the Synthesis report including Timing. TABLE 4: MAPPING REPORT OF THE FOR_LOOP GCD TABLE 5: SYNTHESIS AND TIMING REPORT OF THE FOR_LOOP GCD In this model, there is nothing could be done further except changing the maximum number of the iterations which affects the performance (i.e., generates poorer latency for higher max #iterations). In fact, it is supposed to be faster, see behavioural simulation, as the design is purely parallel design. Yet, the huge circuitry raise the need to jump through interconnection hoops. Finally, this model works faster in larger devices such as XC6SLX150. IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 12 Hardware Statistics XC6SLX25 Total Total Used % # Slices 3,758 2,697 71.77% # LUTs 15,032 8,776 58.38% # MUXCYs 7,516 4,760 63.33% # Registers 30,064 0 0.00% # DSP 38 0 0.00% # IOBs 226 48 21.24% Macros Statistics # 16-bit Add/Sub/Acc 198 # Registers 0 # 16-bit Comparators (=,>) 199 # 2-1 Multiplexers (16-bit) 298 # XOR 0 # DSP 0 # FSM 0 Time Element ns Register to Register Paths 0 Input to Register Paths 0 Register to Out-pad Paths 0 In-pad to Out-pad Paths 478 Total Time Delay 478
  • 17. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Behavioural Level: From AMS to FSM (ASM2FSM) Recalling again the “Simplified Euclidean GCD Algorithm” in (Fig. 3), it can be considered as an Arithmetic State Machine (ASM) that describes the behaviour of the GCD circuit. Then, the three states FSM is an RTL implementation of the ASM circuit (Fig. 11). Fig.11 From ASM GCD to Finite State Diagram Using the basic “States Reduction” rule, S1 => S2. The new FSM with sample of the code are shown in (Fig. 12). Fig.12 The Reduced Finite State Diagram with VHDL Code IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 13 ⇒ WHEN S0 => ELSIF (AR > BR) THEN IF (Start = '1') THEN EnA <= '1'; NextState <= S1; EnB <= '0'; Else NextState <= S1; NextState <= S0; Else End If; EnA <= '0'; WHEN S1 => EnB <= '1'; AM <= AS; BM <= BS; NextState <= S1; IF (AR = BR) THEN End If; GCD <= BR; AS => AR - BR; NextState <= S0; BS => BR - AR;
  • 18. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 (Fig. 13) highlights the Behavioural Simulation results of the ASM2FSM Model, while (Fig. 14) shows RTL, Technology Schematic, and Floor-plan (from PlanAhead). Fig.13 The Behavioural Simulation of ASM2FSM Model ! Fig.14 RTL, Technology Schematic, and Floor-plan of ASM2FSM Model IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 14
  • 19. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Again, (TABLE 6 & 7) highlight the Mapping and Synthesis reports including Timing and indicating dramatic improvement in terms of both, area and speed. TABLE 6: MAPPING REPORT: ASM2FSM GCD VS. FOR_LOOP GCD TABLE 7: SYNTHESIS AND TIMING REPORT - ASM2FSM GCD VS. FOR_LOOP GCD There is no comparison between the results that was obtained with the ASM2FSM model with the FOOR_LOOP ones, considering the huge area saving and the ability to compute the GCD with very large number of iterations. IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 15 HW Statistics Total FOR_LOOP ASM2FSM # Slices 3,758 2,697 71.77% 21 0.56% # LUTs 15,032 8,776 58.38% 58 0.39% # MUXCYs 7,516 4,760 63.33% 48 0.64% # Registers 30,064 0 0.00% 33 0.11% # DSP 38 0 0.00% 0 0.00% # IOBs 226 48 21.24% 52 23.01% Macros Statistics FOR ASM # 16-bit Add/Sub/Acc 198 2 # Registers 0 33 # 16-bit Comparators (=,>) 199 2 # 2-1 Multiplexers (1, 16-bit) 298 8 # XOR 0 0 # DSP 0 0 # FSM 0 1 Time Element | ns FOR ASM Register to Register Paths 0 5.07 Input to Register Paths 0 2.96 Register to Out-pad Paths 0 6.59 In-pad to Out-pad Paths 478 0 Total Time Delay 478 14.62
  • 20. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Structural level: GCD Data-Path and Control Units (GCD2SUB) The next step was to build a Data-Path for the computation unit, which could be as shown in the block diagram in (Fig. 15). Fig.15 Block Diagram of the “Original” GCD Data-Path Because the comparison unit could be implemented as a subtractor, it was obvious to use the CARRY_OUT signals of the subtractors in (Fig. 15) as AGB and ALB signals (Fig. 16). Fig.16 Block Diagram of the Modified GCD Data-Path IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 16
  • 21. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 The “FSM block” in (Fig. 16) refers to the Control Unit (Fig. 17) that drives the Control signals of the GCD data-path (i.e., Registers’ Enable signals). It is important to note that the MUXs’ select signals are driven by the signals AGB and AEB directly, whereas for the REGs’ enable signals, smaller MUXs (i.e., 1-bit) were built by the control unit. Fig.17 The Control Unit (FSM) with VHDL Code In this model, subtractors are the bottle neck of the design as they combined the subtraction and comparison at the same time. They need to be as fast as their results must be ready before the next clock occurrence. Therefor, fast CARRY4 primitive (Fig. 18.b), which utilizes the dedicated Carry-Chain in SliceL and SliceM inside Spartan-6 FPGA, was adapted in the design to perform faster subtraction. Furthermore, LUT2 and LUT3 Macros where used to accommodate some logic functions such AND, XOR, and multiplexer (Fig. 18.a). Fig.18.a Utilization of Primitive (FDCE), and Macros (LUT2 & LUT3) IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 17 AEB <= AGB NOR ALB; WHEN S1 => Finish <= AEB; EnA <= AGB; WHEN S0 => EnB <= ALB; IF (Start = '1') THEN IF (AEB = '1') THEN NextState <= S1; NextState <= S1; Else Else NextState <= S0; NextState <= S0; End If; End If; MUX2x1_inst : LUT3 -- MUX 2x1 XOR_inst : LUT2 -- A XOR B Generic (INTIT <= X“AC”;) Generic (INTIT <= X“6”;) PORT MAP(O=>O, I2=>S, I1=>A, I0=>B); PORT MAP(O=>P, I1=>A, I1=>B); FF_inst : FDCE -- Flip-Flop AND_inst : LUT2 -- A AND B Generic (INTIT <= ‘0’;) Generic ( INTIT <= X”8”; ) PORT MAP(Q=>Q,C=>C,CE=>C,CLR=>C,D=>D); PORT MAP(O=>P, I1=>A, I1=>B);
  • 22. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Fig.18.b Primitives: CARRY4 Fast Carry-Chain (Fig. 19) shows RTL, Technology Schematic, and Floor-plan of GCD2SUB. Fig.19 RTL, Technology Schematic, and Floor-plan of GCD2SUB Model IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 18 CARRY4_inst : CARRY4 PORT MAP ( CO => CO,-- 4-bit carry out O => Sub ,-- 4-bit carry chain XOR data out CI => ‘1’,-- 1-bit carry cascade input CYINIT => ‘1’,-- 1-bit carry initialization DI => A,-- 4-bit carry-MUX data in S => P); -- 4-bit carry-MUX select input
  • 23. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Perfectly, the Mapping and Synthesis reports (TABLE 9 & 10) prove the presumable results of the design and it was clearly “Faster” and “Areas saver”. TABLE 9: MAPPING REPORT: OPTIMIZED VS. SIMPLE GCD2SUB TABLE 10: SYNTHESIS AND TIMING REPORT - OPTIMIZED VS. SIMPLE GCD2SUB The comparison was between two versions of the GCD2SUB; The Optimized version using primitives and macros, and a simple version with high level components (i.e., “-“ for subtraction, “Select” for Multiplexer, …etc, even the comparator was defined in this version). It was clear that although the tool is capable of Optimizing Macros in a good way, the designer could utilize the dedicated Primitives and Macros for more efficient optimization. IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 19 HW Statistics Total Simple GCD2SUB Optimized GCD2SUB # Slices 3,758 16 0.43% 15 0.40% # LUTs 15,032 53 0.35% 51 0.34% # MUXCYs 7,516 52 0.69% 32 0.43% # Registers 30,064 33 0.11% 51 0.17% # DSP 38 0 0.00% 0 0.00% # IOBs 226 52 23.01% 52 23.01% Macros Statistics S2S O2S # 16-bit Add/Sub/Acc 2 0 # Registers 33 33 # 16-bit Comparators (=,>) 2 0 # 2-1 Multiplexers (1, 16-bit) 3 3 # XOR 0 0 # DSP 0 0 # FSM 1 1 Time Element | ns S2S O2S Register to Register Paths 5.14 3.31 Input to Register Paths 3.07 2.96 Register to Out-pad Paths 6.72 3.67 In-pad to Out-pad Paths 0 0 Total Time Delay 14.93 9.94
  • 24. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Structural level: GCD with Sum of Absolute Difference (GCDSAD) Sum of Absolute Different (SAD) replaces the two subtractors using Carry-Out Generation Function (Fig. 20 & 21). It expected to give better result than GCD2SUB as it uses less components and produces less outputs. Fig.20 GCD with Sum of Absolute Difference (GCDSAD) Fig.21 Carry-out Generation Functions for SAD IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 20 GPBLOCK: FOR i IN 1 TO (N/4) generate PB8<=PB4(1)AND PB4(2); PB4(i)<=P(4*i-1) AND P(4*i-2) AND GB8<=GB4(2)OR(GB4(1) P(4*i-3) AND P(4*i-4); AND PB4(2)); GB4(i)<= G(4*i-1)OR(G(4*i-2)AND P(4*i-1)) GN<= G(3)OR(G(2)AND P(3)) OR(G(4*i-3)AND P(4*i-1)AND P(4*i-2))OR OR(G(1)AND P(3)AND P(2))OR (G(4*i-4)AND P(4*i-1)AND P(4*i-2)AND P(4*i-3)); (G(0)AND P(3)AND P(2)AND P(1)); END Generate GPBLOCK; CO<=GB4(4)OR(PB4(4)AND C12);
  • 25. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Before implementing the primitive of the GCDSAD circuit (Optimized GCDSAD), there was an attempt to try a function called (ABS), which does the same job as GCDSAD, in order to see how the compiler accommodates such function in the hardware level. Also, the simple GCDSAD has been designed using high level component definition. ABS_GCD has given a significant result in terms of speed, while the simple GCDSAD was a bit better in terms of area. (TABLE 11 & 12) compare between ABSGCD, Simple, and Optimized GCDSAD. TABLE 11: MAPPING REPORT: GCDSAD TABLE 12: SYNTHESIS AND TIMING REPORT - GCDSAD IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 21 HW Stat. Total ABSGCD SGCDSAD OGCDSAD # Slices 3,758 22 0.59% 20 0.53% 18 0.48% # LUTs 15,032 73 0.49% 62 0.41% 59 0.39% # MUXCYs 7,516 52 0.69% 16 0.21% 16 0.21% # Registers 30,064 34 0.11% 33 0.11% 41 0.14% # DSP 38 0 0.00% 0 0.00% 0 0.00% # IOBs 226 52 23.01% 52 23.01% 52 23.01% Macros Statistics ABS SSAD OSAD # 16-bit Add/Sub/Acc 2 1 0 # Registers 33 33 33 # 2-1 MUX (1, 16-bit) 5 5 3 # XOR 15 33 0 # DSP 0 0 0 # FSM 1 1 0 Time Element | ns ABS SSAD OSAD R to R Paths 5.40 10.92 8.40 In to R Paths 3.19 3.19 2.96 R to Out Paths 5.80 13.11 3.63 In to Out Paths 0 0 0 Total Time Delay 14.39 27.22 14.99
  • 26. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Overview of the Results Recalling all the design architectures and their area/time figures, this section reveals the conclusion in numbers and charts (TABLE 13 & Fig. 22, & Fig. 23). Summary of the results for the different architectures TABLE 13: OVERVIEW OF THE RESULTS (O; OPTIMIZED WITH PRIMITIVES) IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 22 HW Stat. Total FOR ASM OGCD2SUB OGCDSAD # Slices 3,758 2,697 71.77% 21 0.56% 15 0.40% 18 0.48% # LUTs 15,032 8,776 58.38% 58 0.39% 51 0.34% 59 0.39% # MUXCYs 7,516 4,760 63.33% 48 0.64% 32 0.43% 16 0.21% # Registers 30,064 0 0.00% 33 0.11% 51 0.17% 41 0.14% # IOBs 226 48 21.24% 52 23.01% 52 23.01% 52 23.01% Macros Statistics FOR ASM OGCD2SUB OGCDSAD # 16-bit Add/Sub/Acc 198 2 0 0 # Registers 0 33 33 33 # 16-bit Comparators (=,>) 199 2 0 0 # 2-1 MUX (1, 16-bit) 298 8 3 3 Time Element | ns FOR ASM OGCD2SUB OGCDSAD Register to Register Paths 0.00 5.07 3.31 8.40 Input to Register Paths 0.00 2.96 2.96 2.96 Register to Out-pad Paths 0.00 6.59 3.67 3.63 In-pad to Out-pad Paths 478.00 0.00 0.00 0.00 Total Time Delay 478.00 14.62 9.94 14.99
  • 27. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 The results in a Chart: Fig.22 Results in a Chart (FOR_LOOP dominated) The Area-Delay Product (ASM, PGCD2SUB, and PGCDSAD) Fig.23 The Area-Delay Product IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 23 #SLICES TIME DELAY AREA*DELAY #SLICES TIME DELAY AREA*DELAY
  • 28. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Summary and Conclusion (TABLE 14) Summarizes the work that has been done and compares between all the versions of the Euclidean GCD design and its implementation on Xilix Spartan-6 FPGA. TABLE 14: COMPARISON SUMMARY BETWEEN ARCHITECTURES The overall results shows that the optimized GCD2SUB design has the least Area-Delay product among the other models in this project, which means it provides fast computation of the Euclidean GCD Algorithm, while saving area a great deal. Apart from the slow, limited and area consuming FOOR_LOOP GCD model, the other architectures were not too far for GCD2SUB model, especially, the Simple GCD2Sub and the Optimized GCDSAD. However, GCDSAD could be better than Simple GCD2Sub because of the full control over placement which might make its Area-Delay product significantly better. Furthermore, some components, such as FSM Flip-Flops and MUXs, could be implemented using primitives and placed efficiently in order to provide more reduction on the Area-Delay product. Factors Architectures Area*Delay Macros & Primitives Design features & Synthesizability Behavioural For Loop 7 1289166 198 16bit Sub, 199 16bit comp, 298 16bit MUX-2x1 Depends on max. #iterations, works faster in larger devices ASM2FSM 4 307.02 2 16bit Loadable Accumulators 33 Registers, 2 16bit Comp, 2 16bit and 6 1-bit MUX-2x1 FSM is the Top Entity, Depends on compiler Macros Still no control over placement Simple Structural GCD2Sub 2 238.88 2 16bit Loadable Accumulators 33 Registers, 2 16bit Comp, 3 1-bit MUX-2x1 Datapath & FSM Control Units Depends on components def. Still no control over placement GCDSAD 6 544.4 16bit Add (w Cin), 33 Registers 2 16bit and 3 1-bit MUX-2x1 1 16bit and 32 1-bit XOR DP &CU, Dep. on components, Still no control over placement Utilizes SAD circuits (1 ADD) GCDABS 5 316.58 16bit Sub, 16bit Add, 15 XOR 33 Registers, 2 16bit Comp, 2 16bit and 3 1-bit MUX-2x1 DP &CU, Dep. on components and the compiler Macros Still no control over placement Optimized Strucutral GCD2Sub 1 149.1 33 Registers, 3 1-bit MUX-2x1 DP &CU, Utilizes Primitives (LUT2,3 & CARRY4), Fast, Full Control over placement GCDSAD 3 269.82 33 Registers, 3 1-bit MUX-2x1 DP &CU, Utilizes Primitives (LUT2,3 & CARRY4), Smart, Full Control over placement IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 24
  • 29. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Final Thoughts and Suggestions The Euclidean GCD Algorithm design journey has brought great experience, from infinite loop to loop limitations, then thought RTL behavioural architecture to the structural architecture, to the optimized design, where, Primitives and Macros were utilized to reduce the Area-Delay product of the design. Pro's & Con's of the main architectures can be: Behavioural - Apart from “Design Strategies & Goals,” there in no control at any level on the implemented circuit or the placement and routing of the design. ✤ It is High Level Coding approach, which is easier to write and manage. Structural - The design could be much more complex than behavioural especially with Primitives. ✤ By utilizing Primitives & Macros efficiently, there is gain of full control over the placement. ✤ Having the data-path and control units separated, allows for better optimization. It is important to note that utilizing primitives and macros efficiently helps to reduce the jumps through interconnections and maintain a logical and persistent data flow in the design. For instance, 16-bit Carry-Look-Ahead Subtractor (CLASub) is assumed to be faster than the ripple carry. However, utilizing CARRY4 primitive to benefit form the dedicated Carry-chain with the help of propagation function (i.e., Half-Adder SUM - XOR), gives an Area-Delay product of about 10 times better than using full CLASub with primitives. Finally, it would be fair to mention that both, ASM2FSM, Simple GCD2SUB, and Simple GCDSAD were implemented with the enforcement of using DSP as a primitive. The time delay in all cases was not promising, and the occupied area inside the FPGA was greater than using the CLB’s slices. However, it seems somehow possible to utilize the DSP itself in order to benefit from its features to perform the whole computations of the Euclidean GCD Algorithm. This might be a reasonable suggestion for future work related to GCD design on FPGA in addition to learning more about the tools and their helpful features.
 IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 25
  • 30. ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014 Bibliography 1. Wikipedia, (2014). Greatest common divisor. [online] Available at: http://en.wikipedia.org/wiki/ Greatest_common_divisor [Accessed 15 Dec. 2014]. 2. EE254L – GCD (University of South California): Subject Lab Manual: http://www-classes.usc.edu/engr/ee- s/254/ee254l_lab_manual/. 3. Lesson 93 - Example 63: GCD Algorithm - VHDL while Statement [A tutorial on datapaths and state machines for computing the GCD / While Loops accompanies the book Digital Design Using Digilent FPGA Boards]. (Nov 2012). LBEbooks. Retrieved from https://www.youtube.com/watch?v=DMSaYhD1GkM. 4. “Spartan-6 Family Overview (v2.0),” Xilinx, 2011. http://www.xilinx.com/support/documentation/ data_sheets/ds160.pdf. 5. “Spartan-6 FPGA Configurable Logic Block, UG384 (v1.1),” Xilinx, 2010. http://www.xilinx.com/support/ documentation/user_guides/ug384.pdf. 6. “Spartan-6 Libraries Guide for HDL Designs, UG615 (v 14.1),” Xilinx, 2012. http://www.xilinx.com/support/ documentation/sw_manuals/xilinx14_1/spartan6_hdl.pdf. 7. “XST User Guide for Virtex-6, Spartan-6, and 7 Series Devices, UG687 (v 13.4),” Xilinx, 2012. http:// www.xilinx.com/support/documentation/sw_manuals/xilinx14_1/xst_v6s6.pdf. 8. “ISE In-Depth Tutorial, UG695 (v 12.1)”, Xilinx , 2009. http://www.xilinx.com/support/documentation/ sw_manuals/xilinx14_1/spartan6_hdl.pdf. 9. Sima, M. (2014). ELEC669 'Reconfigurable Computing. -[Lecture Notes] 10. Devi, R., Singh, J. and Singh, M. (2011). VHDL Implementation of GCD Processor with Built in Self Test Feature. International Journal of Computer Applications, 25(2), pp.50-54. 11. C.P, N. and M. Ravi Kumar, K. (2014). Efficient Comparator based Sum of Absolute Differences Architecture for Digital Image Processing Applications. International Journal of Computer Applications, 96(4), pp.17-24. 12. TechOnlineIndia, (2014). An introduction to FPGA timing analysis [online] Available at http:// www.techonlineindia.com/techonline/news_and_analysis/170126/introduction-fpga-timing-analysis. IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 26