GCD-FPGA-Based-DesignE

GCD FPGA-Based Design
Ibrahim Hazmi - V00835716 
Design and Implementation of the
Euclidean Algorithm for Computing
the Greatest Common Divisor using
ELEC569A Project Final Report (Fall, 2014)
Supervised by Dr. Mihai Sima

ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014
Contents
GCD FPGA-Based Design i
Contents ii
List of Figures iii
List of Tables iv
Visual Executive Summary 1
Introduction 2
Background of GCD and Euclidean Algorithm 2
Overview of Spartan-6 FPGA 4
Project Description and Milestones 7
Detailed Description of the Design 9
Behavioural Level: WHILE/FOR LOOP 9
Behavioural Level: From AMS to FSM (ASM2FSM) 13
Structural level: GCD Data-Path and Control Units (GCD2SUB) 16
Structural level: GCD with Sum of Absolute Difference (GCDSAD) 20
Overview of the Results 22
Summary of the results for the different architectures 22
The results in a Chart: 23
The Area-Delay Product (ASM, PGCD2SUB, and PGCDSAD) 23
Summary and Conclusion 24
Final Thoughts and Suggestions 25
Bibliography 26
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 ii

List of Figures
Fig.1 Prime Factorization method for finding the GCD of two integers 2
Fig.2 Euclidean Algorithm 3
Fig.3 Simplified Euclidean GCD Algorithm 3
Fig.4 XC6SLX25 Floor-plan View in PlanAhead 5
Fig.5 Closer View of the XC6SLX25 Floor-plan View in PlanAhead 6
Fig.6 The Design Strategy Window from Xilinx Project Navigator 8
Fig.7 WHILE_LOOP translations of the Simplified Euclidean GCD Algorithm 9
Fig.8 Finite FOR_LOOP as a replacement for the Infinite WHILE_LOOP 10
Fig.9 The Behavioural and Post-Route Simulation of FOR_LOOP Model 11
Fig.10 RTL, Technology Schematic, and Floor-plan of FOR_LOOP Model 11
Fig.11 From ASM GCD to Finite State Diagram 13
Fig.12 The Reduced Finite State Diagram with VHDL Code 13
Fig.13 The Behavioural Simulation of ASM2FSM Model 14
Fig.14 RTL, Technology Schematic, and Floor-plan of ASM2FSM Model 14
Fig.15 Block Diagram of the “Original” GCD Data-Path 16
Fig.16 Block Diagram of the Modified GCD Data-Path 16
Fig.17 The Control Unit (FSM) with VHDL Code 17
Fig.18.a Utilization of Primitive (FDCE), and Macros (LUT2 & LUT3) 17
Fig.18.b Primitives: CARRY4 Fast Carry-Chain 18
Fig.19 RTL, Technology Schematic, and Floor-plan of GCD2SUB Model 18
Fig.20 GCD with Sum of Absolute Difference (GCDSAD) 20
Fig.21 Carry-out Generation Functions for SAD 20
Fig.22 Results in a Chart (FOR_LOOP dominated) 23
Fig.23 The Area-Delay Product 23
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 iii

List of Tables
Table 1: Xilinx spartan-6 FPGA Feature Summary [4], [5] 4
Table 2: Slice features of spartan-6 Family (including XC6SLX25) [5] 6
Table 3: Project Milestones 8
Table 4: Mapping Report of the FOR_LOOP GCD 12
Table 5: Synthesis and Timing Report of the FOR_LOOP GCD 12
Table 6: Mapping Report: ASM2FSM GCD vs. FOR_LOOP GCD 15
Table 7: Synthesis and Timing Report - ASM2FSM GCD vs. FOR_LOOP GCD 15
Table 9: Mapping Report: Optimized vs. Simple GCD2SUB 19
Table 10: Synthesis and Timing Report - Optimized vs. Simple GCD2SUB 19
Table 11: Mapping Report: GCDSAD 21
Table 12: Synthesis and Timing Report - GCDSAD 21
Table 13: Overview of The results (O; Optimized with Primitives) 22
Table 14: Comparison Summary between Architectures 24
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 iv

Visual Executive Summary 
IBRAHIM HAZMI - IHAZ@UVIC.CA V00835716 1
The main idea of this project is to design a Digital Circuit that calculates the GCD of two 16-
bit unsigned integer numbers using Euclidean Algorithm and Implement it on Xilinx Spartan6
FPGA using different techniques/architectures. The ﬁrst attempt was to see how far the
compiler goes with the behavioural loop that represents Euclidean Algorithm. Because the
tools kept copying the hardware inside the loop all the time, a massive area of the FPGA was
occupied and the number of iterations was limited. Thus, an RTL behavioural architecture
was implemented, in which only one iteration can run per each clock cycle. The compiler still
have the freedom for placement and routing with the aid of “Design and Goals Strategies”.
Then, the design was built structurally, by port-mapping all functions of the previous design
as components, to see how the compiler is going to utilize the FPGA differently from the
behavioural one. The structural model consists of two parts: GCD data-path unit and GCD
control unit (FSM). Another version of the structural design was created as an attempt to
adapt the idea of the “Sum of Absolute Difference (SAD)” in order to have only one
subtraction instead of two. Finally, Spartan-6 Primitives and Macros were utilized to reduce
the Area-Delay product of the design, and the optimized GCD with two subtractors has been
proved to give the minimum Area-Delay product among all other design architectures.

Introduction
Implementing mathematical calculations on hardware platforms such as FPGA is quite
more challenging than performing them in a software environment, where the hardware itself
is already equipped with the calculation data-path and control unit for almost inﬁnite number
of algorithms and arithmetic operations. Behind this pain of the hardware implementation is a
priceless gain in terms of performance as there is a great opportunity to utilize smaller area,
obtain higher speed, consume less power, or get a reasonable combination of all of these.
Calculating The greatest common divisor (GCD), is one of the problems that need number of
steps in order to be solved correctly. These steps can be transformed into an iterative
algorithm such as Euclidean algorithm, which makes the computation understandable and
traceable. This section is divided into three parts; a brief mathematical background about the
GCD computation, an overview of Xilinx Spartan-6 FPGA, and an outline of the project
description highlighting its objective and milestones.
Background of GCD and Euclidean Algorithm
The greatest common divisor (GCD) of two positive integers is the largest integer that
divides both numbers without a remainder [2]. It is also know as Greatest Common Factors
(GCF), Greatest Common Measure (GCM), Highest Common Divisor (HCD), or Highest
Common Factor (HCF) [1]. GCD can be computed by determining the prime factors of both
numbers, then multiplying the common prime factors. Practically, this method is not feasible
for great numbers. (Fig. 1) shows an example of how prime factorization method works.
Fig.1 Prime Factorization method for ﬁnding the GCD of two integers

An efficient method for solving GCD problems is Euclidean algorithm, which is based
on the fact that the GCD of two numbers divides the remainder of the division between them:
It is an iterative process (Fig. 2), that
takes a number of cycles to compute the
GCD. Divisions are done iteratively until
rn = 0, is obtained, then, the GCD = rn-1.
Fig.2 Euclidean Algorithm
As division is simply a subtraction, it was observed that the GCD of two numbers also
divides their difference [1], in which the design and implementation of the circuit gets easier.
The flowchart in (Fig. 3) illustrates this
simple computation process of the GCD. So,
it is obvious that the circuit should include
subtraction and comparison units, in
addition to registers and multiplexers for
data update in each iteration cycle.
Fig.3 Simplified Euclidean GCD Algorithm
gcd(a,b) = gcd(b,r)
where, a = qb + r
gcd(a,b) = gcd(b,(a − b))
= gcd(a,(b − a))
gcd(a,b) = gcd(b,r1)
gcd(b,r1) = gcd(r1,r2 )
!

Overview of Spartan-6 FPGA
From the previous part, the chosen FPGA should have some properties to accommodate
the design units efficiently. For example, subtraction/addition may take advantage of some
dedicated components in the FPGA slices such as ripple carry-chain or DSP. Spartan-6 FPGA
family from Xilinx provides the designers with such components, which would help a great
deal in designing GCD circuit in different levels. “The thirteen-member family delivers
expanded densities ranging from 3,840 to 147,443 logic cells, with half the power consumption
of previous Spartan families, and faster, more comprehensive connectivity,” [4]. (TABLE 1)
shows a feature summary of some devices from this family; the smallest (XC6SLX4), the
largest (XC6SLX150T), and the choice of this project (XC6SLX25), which was the smallest
member of the family to accommodate the first design, i.e., the reference one. More about this
device and the reference design is presented later in this report.
TABLE 1: XILINX SPARTAN-6 FPGA FEATURE SUMMARY [4], [5]
In the next two pages, the Floor-plan views in PlanAhead for the device XC6SLX25 help
to illustrate the internal construction of the chosen device. (Fig. 4) is a full-scale floor-plan
view that shows the device layout indicating some important elements such as IOB cells,
CLBs, DSP, and block RAM columns.
Device
Logic 
Cells
CLB
DSP
Slices
RAM
Blocks
User
I/OSlices FFs RAM (kb) LUT6
XC6SLX4 3,840 600 4,800 75 2,400 8 12 132
XC6SLX25 24,051 3,758 30,046 229 15,032 38 52 226
XC6SLX150T 147,443 23,038 184,304 1,355 92,152 180 268 540

Fig.4 XC6SLX25 Floor-plan View in PlanAhead
In (Fig. 5), a closer view of the layout reveals the three different slices inside the
Conﬁgurable Logic Block (CLB) surrounding a DSP block. It is clear from the ﬁgure that Each
CLB contains two slices, one of them is SLICEX and the other one is either SLICEL or SLICEM.
(TABLE 2) presents important slice features of the XC6SLX25 device.
`
Memory
Controller
Block
Block RAM
Column
Clock
Management
Tile Column
DSP Column
CLB Cell
IOB Cells

Fig.5 Closer View of the XC6SLX25 Floor-plan View in PlanAhead
TABLE 2: SLICE FEATURES OF SPARTAN-6 FAMILY (INCLUDING XC6SLX25) [5]
DSP
Flip-
Flops
Carry-
Chain
Storage
LUTs
LUT6
Slices SLICEX SLICEL SLICEM
6-Input LUTs √ √ √
8 Flip-ﬂops √ √ √
Wide Multiplexers √ √
Carry Logic √ √
Distributed RAM √
Shift Registers √

Project Description and Milestones
The objective of this project is to design a Digital Circuit that calculates the GCD of two
16-bit unsigned integer numbers using Euclidean Algorithm and Implement it on Xilinx
Spartan6 FPGA. In this project, the Euclidean GCD circuit was implemented using different
architectures in order to examine the tradeoff between area and speed, i.e., Area-Delay
product, and decide which design is more implementable in terms of dedicated configuration
components inside the FPGA. The first step was to implement a simple behavioural loop,
i. e., a direct interpretation of the Euclidean GCD Algorithm, using FOR_LOOP to see how
the compiler would represent a large number of iterations. Considering this design reference,
the next step was to implement the Euclidean GCD circuit in the following levels:
A. RTL Behavioural level, where the design is simply a Finite State Machine (FSM) that
performs the GCD calculation sequentially as a lower level of interpreting the Algorithm.
In this case, the compiler was free to translate the operations into different units/
components and place all these components and rout all the connections with the aid of
“Design and Goals Strategies” optimization.
B. Structural level, where the data flow of the GCD Algorithm is transformed into an
arithmetic circuit, i.e., data-path unit (DP), and the iteration process is attained by a simple
control unit (CU), FSM basically. The design was built abstractly transforming all
functions, such as comparison, subtraction, and data transfer, to components and port-
mapping them in a top level entity, to see how the compiler would utilize the FPGA
differently from the behavioural implementation. In this level, “Sum of Absolute
Difference,” (SAD) circuit has been introduced in order to replace the two subtraction units
with one computation unit. Finally, some functions were designed utilizing Primitives,
e.g., Look-Up-Tables “LUTs,” Carry-Chains, Flip-Flops, Exclusive-Ores, and/or DSPs, and
Macros, e.g., ADDSUB macro, with which the calculation unit has been optimized in terms
of occupied area inside the FPGA.

(TABLE 3) is an outline of the project milestones, and how much approximately was
achieved during the journey of the ELEC569A course.
TABLE 3: PROJECT MILESTONES
In the next section all of the above stages will be presented and discussed in sequence. It
is helpful, by this point, to mention that the target is to obtain minimum Area-Delay product,
which could be achieved by reducing the area and/or the time delay of the circuit. By
determining optimization gaol to be area (Fig. 6), smaller area of the FPGA will be utilized in
order to reduce Area*Delay. At the same time, it might also lead to higher speed, assuming
that the smaller area is obtained, the fewer jumps through interconnections is needed.
Fig.6 The Design Strategy Window from Xilinx Project Navigator 
Description Done
Design the Simple Behavioural Loop and examine its aspects and limitations 100
Design the Behavioural FSM Model and test its features and margins 100
Design the direct Structural Model (DP+CU) and compare it with the behavioural 100
Design the Optimized Structural Model (SAD) and compare it with the direct structural 100
Get into Primitive Level and utilize the dedicated elements for faster computation 100
Report the Area/Delay Comparison between all the architectures and propose suggestions 100

Detailed Description of the Design
As mentioned in the previous section, a simple behavioural loop was implemented first,
in order to see how the compiler understands loops and how it deals with a large number of
iterations. Then, the results of this design, as a reference, kept pushing towards trying
different architectures in order to obtain smaller area and less jump through the
interconnection hoops.
Behavioural Level: WHILE/FOR LOOP
Starting with the direct WHILE_LOOP, that represents the Euclidean GCD Algorithm
(Fig. 7), the result tells a lot about how the system treats loops. It was clear that the compiler
has just copied the corresponding circuit along the way until the loop ends. It was not too
surprising that the compiler did not synthesize the (While) function, simply, because it
generates an infinite loop, which means the number of the circuit copies is infinity. In
hardware world, infinity does not usually exist, it needs to be a finite number.
Fig.7 WHILE_LOOP translations of the Simplified Euclidean GCD Algorithm
While (A /= B) Then
If (A > B) Then
A := A - B;
Else
B := B - A;
End If;
End Loop;
GCD <= B;

Thus, the transition to the Finite FOR_LOOB (Fig. 8) was obvious, where the maximum
number of iterations must be defined from the beginning. In fact, determining the number of
iterations before even starting computing the GCD creates limitation to the design, with
which almost an infinite number of GCD calculations will return zero.
Fig.8 Finite FOR_LOOP as a replacement for the Infinite WHILE_LOOP
Before going through the design and implementation results of this model and
proceeding to the other different levels, it is essential to point out that the WHILE_LOOP
Model is a perfect transformation of the Euclidean GCD Algorithm. Therefore, all the
following design effort would be considered as an attempt to produce a synthesizable version
of the WHILE_LOOP Model.
The number of iterations in the FOR_LOOP Model was defined as 100, which means
that for any two numbers that require more than hundred iterations to compute their GCD
(e.g., 511 and 2), the result will be zero. The behavioural and Post-Route simulation of this
design are shown in (Fig. 9) showing the delay for some input examples.
For i in 1 to 100 Loop
If (A /= B) Then
If (A > B) Then
A := A - B;
Else
B := B - A;
End If;
Else
GCD <= B;
End If;
End Loop;

Fig.9 The Behavioural and Post-Route Simulation of FOR_LOOP Model
Furthermore, the complexity of the generated circuit was very high as the system has
converted the loop into a massive number of components. The system has just copied the
comparators, subtractors, and multiplexers a hundred times (i.e., no registers at all). (Fig. 10)
shows RTL, Technology Schematic, and Floor-plan (from PlanAhead) of FOR_LOOP Model.
Fig.10 RTL, Technology Schematic, and Floor-plan of FOR_LOOP Model
Behavioural Simulation
Post-Route Simulation

(TABLE 4) highlights the huge number of units that are mapped to satisfy FOR_LOOP
design requirements, whereas (TABLE 5) summarizes the Synthesis report including Timing.
TABLE 4: MAPPING REPORT OF THE FOR_LOOP GCD
TABLE 5: SYNTHESIS AND TIMING REPORT OF THE FOR_LOOP GCD
In this model, there is nothing could be done further except changing the maximum
number of the iterations which affects the performance (i.e., generates poorer latency for
higher max #iterations). In fact, it is supposed to be faster, see behavioural simulation, as the
design is purely parallel design. Yet, the huge circuitry raise the need to jump through
interconnection hoops. Finally, this model works faster in larger devices such as XC6SLX150.
Hardware Statistics XC6SLX25 Total Total Used %
# Slices 3,758 2,697 71.77%
# LUTs 15,032 8,776 58.38%
# MUXCYs 7,516 4,760 63.33%
# Registers 30,064 0 0.00%
# DSP 38 0 0.00%
# IOBs 226 48 21.24%
Macros Statistics
# 16-bit Add/Sub/Acc 198
# Registers 0
# 16-bit Comparators (=,>) 199
# 2-1 Multiplexers (16-bit) 298
# XOR 0
# DSP 0
# FSM 0
Time Element ns
Register to Register Paths 0
Input to Register Paths 0
Register to Out-pad Paths 0
In-pad to Out-pad Paths 478
Total Time Delay 478

Behavioural Level: From AMS to FSM (ASM2FSM)
Recalling again the “Simpliﬁed Euclidean GCD Algorithm” in (Fig. 3), it can be
considered as an Arithmetic State Machine (ASM) that describes the behaviour of the GCD
circuit. Then, the three states FSM is an RTL implementation of the ASM circuit (Fig. 11).
Fig.11 From ASM GCD to Finite State Diagram
Using the basic “States Reduction” rule, S1 => S2. The new FSM with sample of the
code are shown in (Fig. 12).
Fig.12 The Reduced Finite State Diagram with VHDL Code
⇒
WHEN S0 => ELSIF (AR > BR) THEN
IF (Start = '1') THEN EnA <= '1';
NextState <= S1; EnB <= '0';
Else NextState <= S1;
NextState <= S0; Else
End If; EnA <= '0';
WHEN S1 => EnB <= '1';
AM <= AS; BM <= BS; NextState <= S1;
IF (AR = BR) THEN End If;
GCD <= BR; AS => AR - BR;
NextState <= S0; BS => BR - AR;

(Fig. 13) highlights the Behavioural Simulation results of the ASM2FSM Model, while
(Fig. 14) shows RTL, Technology Schematic, and Floor-plan (from PlanAhead).
Fig.13 The Behavioural Simulation of ASM2FSM Model
!
Fig.14 RTL, Technology Schematic, and Floor-plan of ASM2FSM Model

Again, (TABLE 6 & 7) highlight the Mapping and Synthesis reports including Timing
and indicating dramatic improvement in terms of both, area and speed.
TABLE 6: MAPPING REPORT: ASM2FSM GCD VS. FOR_LOOP GCD
TABLE 7: SYNTHESIS AND TIMING REPORT - ASM2FSM GCD VS. FOR_LOOP GCD
There is no comparison between the results that was obtained with the ASM2FSM
model with the FOOR_LOOP ones, considering the huge area saving and the ability to
compute the GCD with very large number of iterations.
HW Statistics Total FOR_LOOP ASM2FSM
# Slices 3,758 2,697 71.77% 21 0.56%
# LUTs 15,032 8,776 58.38% 58 0.39%
# MUXCYs 7,516 4,760 63.33% 48 0.64%
# Registers 30,064 0 0.00% 33 0.11%
# DSP 38 0 0.00% 0 0.00%
# IOBs 226 48 21.24% 52 23.01%
Macros Statistics FOR ASM
# 16-bit Add/Sub/Acc 198 2
# Registers 0 33
# 16-bit Comparators (=,>) 199 2
# 2-1 Multiplexers (1, 16-bit) 298 8
# XOR 0 0
# DSP 0 0
# FSM 0 1
Time Element | ns FOR ASM
Register to Register Paths 0 5.07
Input to Register Paths 0 2.96
Register to Out-pad Paths 0 6.59
In-pad to Out-pad Paths 478 0
Total Time Delay 478 14.62

Structural level: GCD Data-Path and Control Units (GCD2SUB)
The next step was to build a Data-Path for the computation unit, which could be as
shown in the block diagram in (Fig. 15).
Fig.15 Block Diagram of the “Original” GCD Data-Path
Because the comparison unit could be implemented as a subtractor, it was obvious to
use the CARRY_OUT signals of the subtractors in (Fig. 15) as AGB and ALB signals (Fig. 16).
Fig.16 Block Diagram of the Modiﬁed GCD Data-Path

The “FSM block” in (Fig. 16) refers to the Control Unit (Fig. 17) that drives the Control
signals of the GCD data-path (i.e., Registers’ Enable signals). It is important to note that the
MUXs’ select signals are driven by the signals AGB and AEB directly, whereas for the REGs’
enable signals, smaller MUXs (i.e., 1-bit) were built by the control unit.
Fig.17 The Control Unit (FSM) with VHDL Code
In this model, subtractors are the bottle neck of the design as they combined the
subtraction and comparison at the same time. They need to be as fast as their results must be
ready before the next clock occurrence. Therefor, fast CARRY4 primitive (Fig. 18.b), which
utilizes the dedicated Carry-Chain in SliceL and SliceM inside Spartan-6 FPGA, was adapted
in the design to perform faster subtraction. Furthermore, LUT2 and LUT3 Macros where used
to accommodate some logic functions such AND, XOR, and multiplexer (Fig. 18.a).
Fig.18.a Utilization of Primitive (FDCE), and Macros (LUT2 & LUT3)
AEB <= AGB NOR ALB; WHEN S1 =>
Finish <= AEB; EnA <= AGB;
WHEN S0 => EnB <= ALB;
IF (Start = '1') THEN IF (AEB = '1') THEN
NextState <= S1; NextState <= S1;
Else Else
NextState <= S0; NextState <= S0;
End If; End If;
MUX2x1_inst : LUT3 -- MUX 2x1 XOR_inst : LUT2 -- A XOR B
Generic (INTIT <= X“AC”;) Generic (INTIT <= X“6”;)
PORT MAP(O=>O, I2=>S, I1=>A, I0=>B); PORT MAP(O=>P, I1=>A, I1=>B);
FF_inst : FDCE -- Flip-Flop AND_inst : LUT2 -- A AND B
Generic (INTIT <= ‘0’;) Generic ( INTIT <= X”8”; )
PORT MAP(Q=>Q,C=>C,CE=>C,CLR=>C,D=>D); PORT MAP(O=>P, I1=>A, I1=>B);

Fig.18.b Primitives: CARRY4 Fast Carry-Chain
(Fig. 19) shows RTL, Technology Schematic, and Floor-plan of GCD2SUB.
Fig.19 RTL, Technology Schematic, and Floor-plan of GCD2SUB Model
CARRY4_inst : CARRY4 PORT MAP (
CO => CO,-- 4-bit carry out
O => Sub ,-- 4-bit carry chain XOR data out
CI => ‘1’,-- 1-bit carry cascade input
CYINIT => ‘1’,-- 1-bit carry initialization
DI => A,-- 4-bit carry-MUX data in
S => P); -- 4-bit carry-MUX select input

Perfectly, the Mapping and Synthesis reports (TABLE 9 & 10) prove the presumable
results of the design and it was clearly “Faster” and “Areas saver”.
TABLE 9: MAPPING REPORT: OPTIMIZED VS. SIMPLE GCD2SUB
TABLE 10: SYNTHESIS AND TIMING REPORT - OPTIMIZED VS. SIMPLE GCD2SUB
The comparison was between two versions of the GCD2SUB; The Optimized version
using primitives and macros, and a simple version with high level components (i.e., “-“ for
subtraction, “Select” for Multiplexer, …etc, even the comparator was deﬁned in this version).
It was clear that although the tool is capable of Optimizing Macros in a good way, the
designer could utilize the dedicated Primitives and Macros for more efﬁcient optimization.
HW Statistics Total Simple GCD2SUB Optimized GCD2SUB
# Slices 3,758 16 0.43% 15 0.40%
# LUTs 15,032 53 0.35% 51 0.34%
# MUXCYs 7,516 52 0.69% 32 0.43%
# Registers 30,064 33 0.11% 51 0.17%
# DSP 38 0 0.00% 0 0.00%
# IOBs 226 52 23.01% 52 23.01%
Macros Statistics S2S O2S
# 16-bit Add/Sub/Acc 2 0
# Registers 33 33
# 16-bit Comparators (=,>) 2 0
# 2-1 Multiplexers (1, 16-bit) 3 3
# XOR 0 0
# DSP 0 0
# FSM 1 1
Time Element | ns S2S O2S
Register to Register Paths 5.14 3.31
Input to Register Paths 3.07 2.96
Register to Out-pad Paths 6.72 3.67
In-pad to Out-pad Paths 0 0
Total Time Delay 14.93 9.94

Structural level: GCD with Sum of Absolute Difference (GCDSAD)
Sum of Absolute Different (SAD) replaces the two subtractors using Carry-Out
Generation Function (Fig. 20 & 21). It expected to give better result than GCD2SUB as it uses
less components and produces less outputs.
Fig.20 GCD with Sum of Absolute Difference (GCDSAD)
Fig.21 Carry-out Generation Functions for SAD
GPBLOCK: FOR i IN 1 TO (N/4) generate PB8<=PB4(1)AND PB4(2);
PB4(i)<=P(4*i-1) AND P(4*i-2) AND GB8<=GB4(2)OR(GB4(1)
P(4*i-3) AND P(4*i-4); AND PB4(2));
GB4(i)<= G(4*i-1)OR(G(4*i-2)AND P(4*i-1)) GN<= G(3)OR(G(2)AND P(3))
OR(G(4*i-3)AND P(4*i-1)AND P(4*i-2))OR OR(G(1)AND P(3)AND P(2))OR
(G(4*i-4)AND P(4*i-1)AND P(4*i-2)AND P(4*i-3)); (G(0)AND P(3)AND P(2)AND P(1));
END Generate GPBLOCK; CO<=GB4(4)OR(PB4(4)AND C12);

Before implementing the primitive of the GCDSAD circuit (Optimized GCDSAD), there
was an attempt to try a function called (ABS), which does the same job as GCDSAD, in order
to see how the compiler accommodates such function in the hardware level. Also, the simple
GCDSAD has been designed using high level component deﬁnition. ABS_GCD has given a
signiﬁcant result in terms of speed, while the simple GCDSAD was a bit better in terms of
area. (TABLE 11 & 12) compare between ABSGCD, Simple, and Optimized GCDSAD.
TABLE 11: MAPPING REPORT: GCDSAD
TABLE 12: SYNTHESIS AND TIMING REPORT - GCDSAD
HW Stat. Total ABSGCD SGCDSAD OGCDSAD
# Slices 3,758 22 0.59% 20 0.53% 18 0.48%
# LUTs 15,032 73 0.49% 62 0.41% 59 0.39%
# MUXCYs 7,516 52 0.69% 16 0.21% 16 0.21%
# Registers 30,064 34 0.11% 33 0.11% 41 0.14%
# DSP 38 0 0.00% 0 0.00% 0 0.00%
# IOBs 226 52 23.01% 52 23.01% 52 23.01%
Macros Statistics ABS SSAD OSAD
# 16-bit Add/Sub/Acc 2 1 0
# Registers 33 33 33
# 2-1 MUX (1, 16-bit) 5 5 3
# XOR 15 33 0
# DSP 0 0 0
# FSM 1 1 0
Time Element | ns ABS SSAD OSAD
R to R Paths 5.40 10.92 8.40
In to R Paths 3.19 3.19 2.96
R to Out Paths 5.80 13.11 3.63
In to Out Paths 0 0 0
Total Time Delay 14.39 27.22 14.99

Overview of the Results
Recalling all the design architectures and their area/time ﬁgures, this section reveals
the conclusion in numbers and charts (TABLE 13 & Fig. 22, & Fig. 23).
Summary of the results for the different architectures
TABLE 13: OVERVIEW OF THE RESULTS (O; OPTIMIZED WITH PRIMITIVES)
HW Stat. Total FOR ASM OGCD2SUB OGCDSAD
# Slices 3,758 2,697 71.77% 21 0.56% 15 0.40% 18 0.48%
# LUTs 15,032 8,776 58.38% 58 0.39% 51 0.34% 59 0.39%
# MUXCYs 7,516 4,760 63.33% 48 0.64% 32 0.43% 16 0.21%
# Registers 30,064 0 0.00% 33 0.11% 51 0.17% 41 0.14%
# IOBs 226 48 21.24% 52 23.01% 52 23.01% 52 23.01%
Macros Statistics FOR ASM OGCD2SUB OGCDSAD
# 16-bit Add/Sub/Acc 198 2 0 0
# Registers 0 33 33 33
# 16-bit Comparators (=,>) 199 2 0 0
# 2-1 MUX (1, 16-bit) 298 8 3 3
Time Element | ns FOR ASM OGCD2SUB OGCDSAD
Register to Register Paths 0.00 5.07 3.31 8.40
Input to Register Paths 0.00 2.96 2.96 2.96
Register to Out-pad Paths 0.00 6.59 3.67 3.63
In-pad to Out-pad Paths 478.00 0.00 0.00 0.00
Total Time Delay 478.00 14.62 9.94 14.99

The results in a Chart:
Fig.22 Results in a Chart (FOR_LOOP dominated)
The Area-Delay Product (ASM, PGCD2SUB, and PGCDSAD)
Fig.23 The Area-Delay Product
#SLICES TIME DELAY AREA*DELAY
#SLICES TIME DELAY AREA*DELAY

Summary and Conclusion
(TABLE 14) Summarizes the work that has been done and compares between all the
versions of the Euclidean GCD design and its implementation on Xilix Spartan-6 FPGA.
TABLE 14: COMPARISON SUMMARY BETWEEN ARCHITECTURES
The overall results shows that the optimized GCD2SUB design has the least Area-Delay
product among the other models in this project, which means it provides fast computation of
the Euclidean GCD Algorithm, while saving area a great deal. Apart from the slow, limited
and area consuming FOOR_LOOP GCD model, the other architectures were not too far for
GCD2SUB model, especially, the Simple GCD2Sub and the Optimized GCDSAD. However,
GCDSAD could be better than Simple GCD2Sub because of the full control over placement
which might make its Area-Delay product signiﬁcantly better. Furthermore, some
components, such as FSM Flip-Flops and MUXs, could be implemented using primitives and
placed efﬁciently in order to provide more reduction on the Area-Delay product.
Factors
Architectures
Area*Delay Macros & Primitives
Design features &
Synthesizability
Behavioural
For Loop 7 1289166
198 16bit Sub, 199 16bit comp,
298 16bit MUX-2x1
Depends on max. #iterations,
works faster in larger devices
ASM2FSM 4 307.02
2 16bit Loadable Accumulators
33 Registers, 2 16bit Comp,
2 16bit and 6 1-bit MUX-2x1
FSM is the Top Entity,
Depends on compiler Macros
Still no control over placement
Simple
Structural
GCD2Sub 2 238.88
2 16bit Loadable Accumulators
3 1-bit MUX-2x1
Datapath & FSM Control Units
Depends on components def.
GCDSAD 6 544.4
16bit Add (w Cin), 33 Registers
1 16bit and 32 1-bit XOR
DP &CU, Dep. on components,
Utilizes SAD circuits (1 ADD)
GCDABS 5 316.58
16bit Sub, 16bit Add, 15 XOR
DP &CU, Dep. on components
and the compiler Macros
Optimized
Strucutral
GCD2Sub 1 149.1
33 Registers,
3 1-bit MUX-2x1
DP &CU, Utilizes Primitives
(LUT2,3 & CARRY4), Fast,
Full Control over placement
GCDSAD 3 269.82
33 Registers,
3 1-bit MUX-2x1
DP &CU, Utilizes Primitives
(LUT2,3 & CARRY4), Smart,
Full Control over placement

Final Thoughts and Suggestions
The Euclidean GCD Algorithm design journey has brought great experience, from
infinite loop to loop limitations, then thought RTL behavioural architecture to the structural
architecture, to the optimized design, where, Primitives and Macros were utilized to reduce
the Area-Delay product of the design. Pro's & Con's of the main architectures can be:
Behavioural
- Apart from “Design Strategies & Goals,” there in no control at any level on the implemented
circuit or the placement and routing of the design.
✤ It is High Level Coding approach, which is easier to write and manage.
Structural
- The design could be much more complex than behavioural especially with Primitives.
✤ By utilizing Primitives & Macros efficiently, there is gain of full control over the placement.
✤ Having the data-path and control units separated, allows for better optimization.
It is important to note that utilizing primitives and macros efficiently helps to reduce the
jumps through interconnections and maintain a logical and persistent data flow in the design.
For instance, 16-bit Carry-Look-Ahead Subtractor (CLASub) is assumed to be faster than the
ripple carry. However, utilizing CARRY4 primitive to benefit form the dedicated Carry-chain
with the help of propagation function (i.e., Half-Adder SUM - XOR), gives an Area-Delay
product of about 10 times better than using full CLASub with primitives.
Finally, it would be fair to mention that both, ASM2FSM, Simple GCD2SUB, and Simple
GCDSAD were implemented with the enforcement of using DSP as a primitive. The time
delay in all cases was not promising, and the occupied area inside the FPGA was greater than
using the CLB’s slices. However, it seems somehow possible to utilize the DSP itself in order
to benefit from its features to perform the whole computations of the Euclidean GCD
Algorithm. This might be a reasonable suggestion for future work related to GCD design on
FPGA in addition to learning more about the tools and their helpful features. 

Bibliography
1. Wikipedia, (2014). Greatest common divisor. [online] Available at: http://en.wikipedia.org/wiki/
Greatest_common_divisor [Accessed 15 Dec. 2014].

2. EE254L – GCD (University of South California): Subject Lab Manual: http://www-classes.usc.edu/engr/ee-
s/254/ee254l_lab_manual/.

3. Lesson 93 - Example 63: GCD Algorithm - VHDL while Statement [A tutorial on datapaths and state
machines for computing the GCD / While Loops accompanies the book Digital Design Using Digilent FPGA
Boards]. (Nov 2012). LBEbooks. Retrieved from https://www.youtube.com/watch?v=DMSaYhD1GkM.

4. “Spartan-6 Family Overview (v2.0),” Xilinx, 2011. http://www.xilinx.com/support/documentation/
data_sheets/ds160.pdf.

5. “Spartan-6 FPGA Configurable Logic Block, UG384 (v1.1),” Xilinx, 2010. http://www.xilinx.com/support/
documentation/user_guides/ug384.pdf.

6. “Spartan-6 Libraries Guide for HDL Designs, UG615 (v 14.1),” Xilinx, 2012. http://www.xilinx.com/support/
documentation/sw_manuals/xilinx14_1/spartan6_hdl.pdf.

7. “XST User Guide for Virtex-6, Spartan-6, and 7 Series Devices, UG687 (v 13.4),” Xilinx, 2012. http://
www.xilinx.com/support/documentation/sw_manuals/xilinx14_1/xst_v6s6.pdf.

8. “ISE In-Depth Tutorial, UG695 (v 12.1)”, Xilinx , 2009. http://www.xilinx.com/support/documentation/
sw_manuals/xilinx14_1/spartan6_hdl.pdf.

9. Sima, M. (2014). ELEC669 'Reconfigurable Computing. -[Lecture Notes]

10. Devi, R., Singh, J. and Singh, M. (2011). VHDL Implementation of GCD Processor with Built in Self Test
Feature. International Journal of Computer Applications, 25(2), pp.50-54.

11. C.P, N. and M. Ravi Kumar, K. (2014). Efficient Comparator based Sum of Absolute Differences Architecture
for Digital Image Processing Applications. International Journal of Computer Applications, 96(4), pp.17-24.

12. TechOnlineIndia, (2014). An introduction to FPGA timing analysis [online] Available at http://
www.techonlineindia.com/techonline/news_and_analysis/170126/introduction-fpga-timing-analysis.


GCD-FPGA-Based-DesignE

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GCD-FPGA-Based-DesignE

Similar to GCD-FPGA-Based-DesignE (20)

More from Ibrahim Hejab

More from Ibrahim Hejab (9)

GCD-FPGA-Based-DesignE