Compared the performance of several branch predictor types with different RAS configurations and Branch Target Buffer configurations for three individual benchmarks namely GCC,GO and ANAGRAM using the SIMPLESCALAR simulator. Cycles per instruction(CPI),Address rate and Direction rate were the parameters used to compare and draw conclusions.
Cache Design for an Alpha MicroprocessorBharat Biyani
Fine tuned the cache hierarchy of an Alpha microprocessor for three individual benchmarks namely GCC,ANAGRAM and GO by modifying various cache design parameters like cache levels, cache types ( in case of more than one level of cache), sizes, associativity, block sizes and block replacement policy. compared the performance of individual benchmarks for different configurations based on CPI and COST function.
- Defined the specifications and designed an architecture of the MSDAP chip that performs convolution of two signals in least possible area & power.
- Implemented a RTL model of the MSDAP chip which consists of a Controller, ALU, Memories and Serial communication Unit.
- Synthesized the design in Synopsys Design Vision and functionality was verified using the Modelsim
- Final physical design was generated using the IC Compiler.
Designed a 21b X 21b multiplier using Booth-2 algorithm by constructing schematic of decoder, partial product generation & compression and Adder (Carry Look Ahead). Performed Hspice simulation to verify the correct functionality, library characterization of assembled Netlist using Siliconsmart ACE, RTL synthesis of generated library. Timing and power consumed is analyzed through static timing analysis using Synopsys Primetime.
This is a project implemented in VHDL. It is design of a multi-level cache memory for a uni-processor system. The document also includes some of the simulation and synthesis results.
Designed a fully customized 128x10b SRAM by constructing schematic & virtuoso layout of memory cell array (6T cell), row & column decoder, pre-charge circuit, write circuit and sense amplifier using Cadence. Manually placed and routed all components, performed DRC & LVS debugging of constructed schematic and layout and ran PEX to generate the final Netlist, Hspice Spectre simulation of final design for verification of the correct functionality and analysis of best read, best write cycles & the worst case timing for read and write. Timing and power consumed is analyzed through STA-Primetime (Static timing Analysis)
Design of a 64-bit ultra low latency memory using 6T SRAM cells and PDK 45nm technology on CADENCE to simulate the results of our chosen implementation.
Cache Design for an Alpha MicroprocessorBharat Biyani
Fine tuned the cache hierarchy of an Alpha microprocessor for three individual benchmarks namely GCC,ANAGRAM and GO by modifying various cache design parameters like cache levels, cache types ( in case of more than one level of cache), sizes, associativity, block sizes and block replacement policy. compared the performance of individual benchmarks for different configurations based on CPI and COST function.
- Defined the specifications and designed an architecture of the MSDAP chip that performs convolution of two signals in least possible area & power.
- Implemented a RTL model of the MSDAP chip which consists of a Controller, ALU, Memories and Serial communication Unit.
- Synthesized the design in Synopsys Design Vision and functionality was verified using the Modelsim
- Final physical design was generated using the IC Compiler.
Designed a 21b X 21b multiplier using Booth-2 algorithm by constructing schematic of decoder, partial product generation & compression and Adder (Carry Look Ahead). Performed Hspice simulation to verify the correct functionality, library characterization of assembled Netlist using Siliconsmart ACE, RTL synthesis of generated library. Timing and power consumed is analyzed through static timing analysis using Synopsys Primetime.
This is a project implemented in VHDL. It is design of a multi-level cache memory for a uni-processor system. The document also includes some of the simulation and synthesis results.
Designed a fully customized 128x10b SRAM by constructing schematic & virtuoso layout of memory cell array (6T cell), row & column decoder, pre-charge circuit, write circuit and sense amplifier using Cadence. Manually placed and routed all components, performed DRC & LVS debugging of constructed schematic and layout and ran PEX to generate the final Netlist, Hspice Spectre simulation of final design for verification of the correct functionality and analysis of best read, best write cycles & the worst case timing for read and write. Timing and power consumed is analyzed through STA-Primetime (Static timing Analysis)
Design of a 64-bit ultra low latency memory using 6T SRAM cells and PDK 45nm technology on CADENCE to simulate the results of our chosen implementation.
Design and Simulation Low power SRAM Circuitsijsrd.com
SRAMs), focusing on optimizing delay and power. As the scaling trends in the speed and power of SRAMs with size and technology and find that the SRAM delay scales as the logarithm of its size as long as the interconnect delay is negligible. Non-scaling of threshold mismatches with process scaling, causes the signal swings in the bitlines and data lines also not to scale, leading to an increase in the relative delay of an SRAM, across technology generations. Appropriate methods for reduction of power consumption were studied such as capacitance reduction, very low operating voltages, DC and AC current reduction and suppression of leakage currents to name a few.. Many of reviewed techniques are applicable to other applications such as ASICs, DSPs, etc. Battery and solar-cell operation requires an operating voltage environment in low voltage area. These conditions demand new design approaches and more sophisticated concepts to retain high device reliability. The proposed techniques (USRS and LPRS) are topology based and hence easier to implement.
Static-Noise-Margin Analysis of Modified 6T SRAM Cell during Read Operationidescitation
As modern technology is spreading fast, it is very
important to design low power, high performance, fast
responding SRAM(Static Random Access Memory) since they
are critical component in high performance processors. In
this paper we discuss about the noise effect of different SRAM
circuits during read operation which hinders the stability of
the SRAM cell. This paper also represents a modified 6T
SRAM cell which increases the cell stability without
increasing transistor count.
In the project#1, IBM 130nm process is used to design and manual layout a 128 word SRAM, with word size 10bits. Cadence's Virtuoso is applied for layout editing, DRC and LVS running and circuit simulation.
250nm Technology Based Low Power SRAM Memoryiosrjce
High integration density, low power and fastperformance are all critical parameters in designing of
memory blocks. Static Random Access Memories (SRAMs)’s focusing on optimizing dynamic power concept of
virtual source transistors is used for removing direct connection between VDD and GND.
Also stacking effect can be reduced by switching off the stacktransistors when the memory is ideal and the
leakage current using SVL techniques This paper discusses the evolution of 9t SRAM circuits in terms of low
power consumption, The whole circuit verification is done on the Tanner tool, Schematic of the
SRAM cell is designed on the S-Edit and net list simulation done by using T-spice and waveforms are analyzed
through the W-edit
Implementation of High Reliable 6T SRAM Cell Designiosrjce
Memory can be formed with the integration of large number of basic storing element called cells.
SRAM cell is one of the basic storing unit of volatile semiconductor memory that stores binary logic '1' or '0' bit.
Modified read and write circuits were proposed in this paper to address incorrect read and write operations in
conventional 6T SRAM cell design available in open literature. Design of a new highly reliable 6T SRAM cell
design is proposed with reliable read, write operations and negative bit line voltage (NBLV). Simulations are
carried out using MENTOR GRAPHICS
ASIC DESIGN OF MINI-STEREO DIGITAL AUDIO PROCESSOR UNDER SMIC 180NM TECHNOLOGYIlango Jeyasubramanian
- Designed and analyzed a complete MSDAP with optimized convolution computation by only shifts and adds using power-of-2 coefficients. Synthesized the chip through high level architecture design (C Program), Logic synthesis (Synopsys Design Compiler) and Physical Synthesis (Synopsys IC compiler).
- Achieved a low power consumption of 3.1438mW at 29.186Mhz clock frequency, with core utilization of 70% and chip area of 1.29mm2.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
A Novel Architecture Design & Characterization of CAM Controller IP Core with...idescitation
Content Addressable Memory (CAM) is a special
purpose Random Access Memory device that can be accessed
by searching for data content. This paper describes a novel
architecture design and characterization of a reusable soft IP
core for CAM controller with sequential replacement policy,
so as to improve the match ratio of the CAM memory. The
proposed design was modeled using Verilog HDL and also
prototyped in Xilinx® SPARTAN family FPGA.The power
analysis was done using XPower® analyzer and the hardware
test result was obtained by ChipScope® Pro logic analyzer.
Index Terms—
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory and fast transfer rate for different range of devices and for various multimedia applications. Video compression is primarily achieved by Motion Estimation (ME) process in any video encoder which contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42% as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using Virtex 7 FPGA.
Design and Simulation Low power SRAM Circuitsijsrd.com
SRAMs), focusing on optimizing delay and power. As the scaling trends in the speed and power of SRAMs with size and technology and find that the SRAM delay scales as the logarithm of its size as long as the interconnect delay is negligible. Non-scaling of threshold mismatches with process scaling, causes the signal swings in the bitlines and data lines also not to scale, leading to an increase in the relative delay of an SRAM, across technology generations. Appropriate methods for reduction of power consumption were studied such as capacitance reduction, very low operating voltages, DC and AC current reduction and suppression of leakage currents to name a few.. Many of reviewed techniques are applicable to other applications such as ASICs, DSPs, etc. Battery and solar-cell operation requires an operating voltage environment in low voltage area. These conditions demand new design approaches and more sophisticated concepts to retain high device reliability. The proposed techniques (USRS and LPRS) are topology based and hence easier to implement.
Static-Noise-Margin Analysis of Modified 6T SRAM Cell during Read Operationidescitation
As modern technology is spreading fast, it is very
important to design low power, high performance, fast
responding SRAM(Static Random Access Memory) since they
are critical component in high performance processors. In
this paper we discuss about the noise effect of different SRAM
circuits during read operation which hinders the stability of
the SRAM cell. This paper also represents a modified 6T
SRAM cell which increases the cell stability without
increasing transistor count.
In the project#1, IBM 130nm process is used to design and manual layout a 128 word SRAM, with word size 10bits. Cadence's Virtuoso is applied for layout editing, DRC and LVS running and circuit simulation.
250nm Technology Based Low Power SRAM Memoryiosrjce
High integration density, low power and fastperformance are all critical parameters in designing of
memory blocks. Static Random Access Memories (SRAMs)’s focusing on optimizing dynamic power concept of
virtual source transistors is used for removing direct connection between VDD and GND.
Also stacking effect can be reduced by switching off the stacktransistors when the memory is ideal and the
leakage current using SVL techniques This paper discusses the evolution of 9t SRAM circuits in terms of low
power consumption, The whole circuit verification is done on the Tanner tool, Schematic of the
SRAM cell is designed on the S-Edit and net list simulation done by using T-spice and waveforms are analyzed
through the W-edit
Implementation of High Reliable 6T SRAM Cell Designiosrjce
Memory can be formed with the integration of large number of basic storing element called cells.
SRAM cell is one of the basic storing unit of volatile semiconductor memory that stores binary logic '1' or '0' bit.
Modified read and write circuits were proposed in this paper to address incorrect read and write operations in
conventional 6T SRAM cell design available in open literature. Design of a new highly reliable 6T SRAM cell
design is proposed with reliable read, write operations and negative bit line voltage (NBLV). Simulations are
carried out using MENTOR GRAPHICS
ASIC DESIGN OF MINI-STEREO DIGITAL AUDIO PROCESSOR UNDER SMIC 180NM TECHNOLOGYIlango Jeyasubramanian
- Designed and analyzed a complete MSDAP with optimized convolution computation by only shifts and adds using power-of-2 coefficients. Synthesized the chip through high level architecture design (C Program), Logic synthesis (Synopsys Design Compiler) and Physical Synthesis (Synopsys IC compiler).
- Achieved a low power consumption of 3.1438mW at 29.186Mhz clock frequency, with core utilization of 70% and chip area of 1.29mm2.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
A Novel Architecture Design & Characterization of CAM Controller IP Core with...idescitation
Content Addressable Memory (CAM) is a special
purpose Random Access Memory device that can be accessed
by searching for data content. This paper describes a novel
architecture design and characterization of a reusable soft IP
core for CAM controller with sequential replacement policy,
so as to improve the match ratio of the CAM memory. The
proposed design was modeled using Verilog HDL and also
prototyped in Xilinx® SPARTAN family FPGA.The power
analysis was done using XPower® analyzer and the hardware
test result was obtained by ChipScope® Pro logic analyzer.
Index Terms—
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory and fast transfer rate for different range of devices and for various multimedia applications. Video compression is primarily achieved by Motion Estimation (ME) process in any video encoder which contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42% as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using Virtex 7 FPGA.
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory
and fast transfer rate for different range of devices and for various multimedia applications. Video
compression is primarily achieved by Motion Estimation (ME) process in any video encoder which
contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric
in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung
Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder
scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42
% as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using
Virtex 7 FPGA.
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory and fast transfer rate for different range of devices and for various multimedia applications. Video compression is primarily achieved by Motion Estimation (ME) process in any video encoder which contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42 % as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using Virtex 7 FPGA.
Compression is playing a vital role in data transfer. Hence, Digital camera uses JPEG standard to compress the captured image. Hence, it reduces data storage requirements. Here, we proposed FPGA based JPEG encoder. The processing system is coupled with DCT and then it is quantized and then it is prepared for entropy coding to form a JPEG encoder
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORMVLSICS Design
This paper is devoted to the design of dual core crypto processor for executing both Prime field and binary field instructions. The proposed design is specifically optimized for Field programmable gate array (FPGA) platform. Combination of two different field (prime field GF(p) and Binary field GF(2m)) instructions execution is analysed.The design is implemented in Spartan 3E and virtex5. Both the performance results are compared. The implementation result shows the execution of parallelism using dual field instructions
Queue Size Trade Off with Modulation in 802.15.4 for Wireless Sensor NetworksCSCJournals
In this paper we analyze the performance of 802.15.4 Wireless Sensor Network (WSN) and derive the queue size trade off for different modulation schemes like: Minimum Shift Keying (MSK), Quadrature Amplitude Modulation of 64 bits (QAM_64) and Binary Phase Shift Keying (BPSK) at the radio transmitter of different types of devices in IEEE 802.15.4 for WSN. It is concluded that if queue size at the PAN coordinator of 802.15.4 wireless sensor network is to be taken into consideration then QAM_64 is recommended. Also it has been concluded that if the queue size at the GTS or Non GTS end device is to be considered then BPSK should be preferred. Our results can be used for planning and deploying IEEE 802.15.4 based wireless sensor networks with specific performance demands. Overall it has been revealed that there is trade off for using various modulation schemes in WSN devices.
An Energy-Efficient Lut-Log-Bcjr Architecture Using Constant Log Bcjr AlgorithmIJERA Editor
Error correcting codes are used to correct the data from the corrupted signal due to noise and interference. There
are many error correcting codes. Among them turbo codes is considered to be the best because it is very close to
the Shannon theoretical limit. The MAP algorithm is commonly used in the turbo decoder. Among the different
versions of the MAP algorithm Constant log BCJR algorithm have less complexity and good error performance.
The Constant log BCJR algorithm can be easily designed using look up table which reduces the memory
consumption. The proposed Constant log BCJR decoder is designed to decode two blocks of data at a time, this
increases the throughput. The complexity of the decoder is further reduced by the use of the add compare select
(ACS) units and registers. The proposed decoder is simulated using Xilinx ISE and synthesized using Sparten3
FPGA and found out that Constant log BCJR decoder utilized less amount of memory and power than the LUT
log BCJR decoder.
Customizable Microprocessor design on Nexys 3 Spartan FPGA BoardBharat Biyani
- Designed a 4 stage pipelined, 16-bit customizable microprocessor in VHDL which can execute instructions (direct & memory mapped addressing modes), handle interrupts (IVT based), communicate with IO devices including keyboard and VGA monitor and facilitates with a single port BLOCK RAM for Stack, Instruction , Data & IVT memory. Keyboard and VGA controller provides input-output interface gives user flexibility of keying in the instructions through Keyboard that is interfaced with Nexys 3 through USB 2.0; VGA interface to display the output. Keyboard and VGA controllers are also coded in VHDL.
- Implemented the VHDL code on Nexys 3 Spartan FPGA board which involved simulation, synthesis and bit file generation using Xilinx ISE,programming the FPGA with Digilent Adept.
- Employed the debug mode to make the design more user friendly
- An outhouse project completed at Progressive Powercon Pvt. Ltd., Pune, India. Aim is to design and implement a low cost solar electricity generation system for household use.
- Designed DC-DC Converter, Inverter, Micro controller circuitry and some additional accessories to improve the overall performance of the system.
- PIC 16f876A is used as a microcontroller fro PWM Control. All the simulation are performed in PSIM 6.0. PCB layout is carried out in ALTIUM DESIGNER Summer 09 Software.
- Designed a standard cells with gates including Inverter, two input NAND, two Input NOR, two Input XOR, 2:1 Multiplexer, AOI22, OAI3222 and D Flip Flop with minimum area & diffusion breaks by using IBM130 nm process technology.
- Involved library characterization using NCX, RTL synthesis of VHDL code of 32 bit ALU Chip design using Synopsys Design Vision.
Designed a differential input and single ended output high gain (>= 85 dB) operational amplifier using CMOS 0.35um technology using a single independent current source. The amplifier was also designed to achieve a CMRR (>= 80dB), Average Slew Rate (>= 15 V/us), UGF (>= 15 MHz) & Output Voltage Swing ( >= 1.4V). The maximum power dissipation through the complete circuit including the current source branch was limited to 0.3 mW.
Automated Traffic Density Detection and Speed MonitoringBharat Biyani
Designed and proposed an RF system to detect speed and traffic density with a RADAR unit in remote areas and to provide real-time monitoring of the traffic density data with a satellite link. Based on calculated parameters, required RF components from real vendors were identified. The system model is then simulated with the obtained parameters in AWR Virtual System Simulator and analyzed nominal and worst case cascaded gain, noise figure, P1dB and OIP3. The general deviation expected in these parameters was determined by performing yield analysis.
32 bit ALU Chip Design using IBM 130nm process technologyBharat Biyani
- Implemented a 32 bit Arithmetic/Logic unit in VHDL using behavioral Modeling which involves all basic ALU operations including special functionality like binary-to-grey code conversion, parity check, sum of first N numbers. Simulation is performed in ModelSim IDE.
- Involved design using Cadence (Virtuoso Layout/Schematic) and Hspice simulation of standard library cell.
- Involved library characterization using NCX, RTL synthesis of VHDL code using Synopsys Design Vision, auto placement & routing using Encounter, static timing analysis using Synopsys Primetime.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
Adaptive synchronous sliding control for a robot manipulator based on neural ...IJECEIAES
Robot manipulators have become important equipment in production lines, medical fields, and transportation. Improving the quality of trajectory tracking for
robot hands is always an attractive topic in the research community. This is a
challenging problem because robot manipulators are complex nonlinear systems
and are often subject to fluctuations in loads and external disturbances. This
article proposes an adaptive synchronous sliding control scheme to improve trajectory tracking performance for a robot manipulator. The proposed controller
ensures that the positions of the joints track the desired trajectory, synchronize
the errors, and significantly reduces chattering. First, the synchronous tracking
errors and synchronous sliding surfaces are presented. Second, the synchronous
tracking error dynamics are determined. Third, a robust adaptive control law is
designed,the unknown components of the model are estimated online by the neural network, and the parameters of the switching elements are selected by fuzzy
logic. The built algorithm ensures that the tracking and approximation errors
are ultimately uniformly bounded (UUB). Finally, the effectiveness of the constructed algorithm is demonstrated through simulation and experimental results.
Simulation and experimental results show that the proposed controller is effective with small synchronous tracking errors, and the chattering phenomenon is
significantly reduced.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
1. P a g e | 1
The University of Texas at Dallas
Department of Electrical Engineering
EECE/CS 6304: COMPUTER ARCHITECTURE
PROJECT #2
“ ANALYSIS OF DIFFERENT TYPES OF
BRANCH PREDICTORS ”
Submitted by,
Bharat Biyani (2021152193)
Shree Viswa Shamanthan L D (2021180127)
2. P a g e | 1
INTRODUCTION
In computer architecture, a branch predictor is a digital circuit that tries to speculate
which way a branch will go before this is known for sure (i.e., before its execution). The purpose
of the branch predictor is to improve the flow in the instruction pipeline. They play a critical role
in achieving high effective performance in many modern pipelined microprocessor architectures
such as x86.
In this project, we analyze the behavior of different branch predictor configurations in
three well-recognized benchmarks, especially GCC, ANAGRAM and GO. We used simplescalar
sim-outorder, which models all the execution aspects of Alpha 21264. The simulations provide
the CPI values, which we use to compare among different benchmarks.
We have used three types of hardware based branch prediction strategies, they are:
1) Bimodal Predictor: It is a simple predictor, which uses 2-bit saturating counters to predict if a
given branch is likely to be taken or not.
2) Two Level Predictor: A two-level adaptive predictor with an n-bit history is that it can predict
any repetitive sequence with any period if all n-bit sub-sequences are different. The
advantage of the two-level adaptive predictor is that it can quickly learn to predict an
arbitrary repetitive pattern.
3) Combined Predictor: A hybrid predictor also called combined predictor implements more
than one prediction mechanism. The final prediction is based either on a meta-predictor that
remembers which of the predictors has made the best predictions in the past or a majority
vote function based on an odd number of different predictors.
3. P a g e | 2
Part 1: Performance analysis of different types of branch predictors
The simulation is done for different configuration of Return Address Space (RAS) and types of
branch predictions.
Baseline default RAS: Bimodal predictor with the default value for RAS.
-bpred bimod -bpred:bimod 256 -bpred:ras 8 -bpred:btb 64 2
2 Level Predictor: Uses two bit for defining the state for branch predictor.
-bpred 2lev -bpred:2lev 1 256 4 0 -bpred:ras 8 -bpred:btb 64 2
Comb: Combines a two levels and bimodal predictor.
-bpred comb -bpred:comb 256 -bpred:bimod 256 -bpred:2lev 1 256 4 0 -bpred:ras 8 -
bpred:btb 64 2
RAS 4: Change the return address stack (RAS) size to 4.
-bpred bimod -bpred:bimod 256 -bpred:ras 4 -bpred:btb 64 2
RAS 16: Change the return address stack (RAS) size to 16.
-bpred bimod -bpred:bimod 256 -bpred:ras 16 -bpred:btb 64 2
Performance Analysis based on CPI
Sr. No. Configuration
Benchmarks
GCC ANAGRAM GO
1 Baseline 0.95 0.4674 0.7571
2 2 Level Predictor 0.9822 0.4605 0.7893
3 Comb 0.8678 0.4546 0.7516
4 Bimod: RAS 4 0.9538 0.4678 0.7574
5 Bimod: RAS 16 0.9498 0.4674 0.7571
Graphical Representation with above CPI
0
0.2
0.4
0.6
0.8
1
1.2
Baseline 2 Level
Predictor
Comb RAS 4 RAS 16
ANAGRAM
GO
GCC
4. P a g e | 3
Above graph clearly displays the performance of different configurations of branch predictor.
Analysis: Benchmark – GCC vs BP Configurations
GCC benchmark has more CPI as compared to the other benchmarks. Specifically, CPI
improved for combination of two level and bimodal predictor (Comb). It has high CPI for 2 level
predictor which uses two bits for defining state of branch predictor.
Analysis: Benchmark – ANAGRAM vs BP Configurations
From the above graph, we can infer that ANAGRAM benchmark has a less CPI than the
other two benchmarks. The performance of ANAGRAM benchmark is fairly constant for all the
configurations of branch predictor. Specifically, CPI is optimal for combination of two level and
bimodal predictor (Comb).
Analysis: Benchmark – GO vs BP Configurations
Above graph shows that GO benchmark performs better than the GCC benchmark. The
performance of GO benchmark is almost constant for all the configurations of branch predictor.
Specifically, CPI is optimal for combination of two level and bimodal predictor (Comb). With
respect to bimod size variation, if we change baseline configuration from the default return
address space from size of 4 to size of 16, CPI performance gets better. RAS size does not have
much impact on CPI.
5. P a g e | 4
Performance Analysis based on Address Hit Rates
Sr. No. Configuration
Benchmarks
GCC ANAGRAM GO
1 Baseline 0.6734 0.956 0.7071
2 2 Level Predictor 0.6253 0.9575 0.6484
3 Comb 0.8339 0.9694 0.709
4 Bimod: RAS 4 0.6697 0.9555 0.7067
5 Bimod: RAS 16 0.6736 0.9605 0.7071
Graphical Representation with above Address Hit Rates
The above graph clearly shows the performance of different configurations of branch
predictor for different benchmarks.
For ANAGRAM benchmark, except for bimod, Return Address Stack (RAS) size 4, the
Address Hit Rates are appreciable.
For GO benchmark, except for 2 level predictor configuration, the Address Hit Rates are
appreciable.
For GCC benchmark, except for 2 level predictor configuration, the Address Hits Rates are
appreciable.
0
0.2
0.4
0.6
0.8
1
1.2
Baseline 2 Level Predictor Comb Bimod: RAS 4 Bimod: RAS 16
GCC
GO
ANAGRAM
6. P a g e | 5
Performance Analysis based on Direction Hit Rates
Sr. No. Configuration
Benchmarks
GCC ANAGRAM GO
1 Baseline 0.6734 0.9605 0.7929
2 2 Level Predictor 0.7919 0.9614 0.7372
3 Comb 0.8617 0.9738 0.7978
4 Bimod: RAS 4 0.8431 0.9605 0.7929
5 Bimod: RAS 16 0.8431 0.9605 0.7929
The graph for the Direction Hit Rates with respect to every benchmark will provide us
more information on the effect of branch prediction configurations on different benchmarks.
Graphical Representation with above Direction Hit Rates
The Direction Hit Rates of the branch predictors fairly stays constant for each benchmark.
Specifically, ANAGRAM benchmark has more direction hit rates than other two benchmarks. In
this case, 2 level prediction direction rate gives worst performance but when we change Returns
Address Space from 8 to 16 or 8 to 4, it performs better.
0
0.2
0.4
0.6
0.8
1
1.2
Baseline 2 Level Predictor Comb Bimod: RAS 4 Bimod: RAS 16
GCC
GO
ANAGRAM
7. P a g e | 6
Part 2: Modification of the code to accommodate address misses
We carried out modifications in the following two files in simplescalar.
1) bpred.h
2) bpred.c
1) Changes in file bpred.h:
----------------
/* branch predictor def */
struct bpred_t {
------
} dirpred;
struct {
--------
} retstack;
/* stats */
counter_t addr_hits; /* num correct addr-predictions */
counter_t dir_hits; /* num correct dir-predictions (incl addr) */
counter_t addr_misses; /* num addr_misses */
counter_t used_ras; /* num RAS predictions used */
counter_t used_bimod; /* num bimodal predictions used (BPredComb) */
-----------
};
2) Changes in file bpred.c:
-----------
sprintf(buf, "%s.dir_hits", name);
stat_reg_counter(sdb, buf, "total number of direction-predicted hits " "(includes addr-
hits)",
&pred->dir_hits, 0, NULL);
sprintf(buf, "%s.addr_misses", name);
stat_reg_counter(sdb, buf, "total number of addr-misses",
&pred->addr_misses, 0, NULL);
-----------
if (bpred == NULL)
return;
bpred->dir_hits = 0;
bpred->addr_misses = 0;
-----------
/* Have a branch here */
if (correct)
pred->addr_hits++;
if (!!pred_taken == !!taken)
pred->dir_hits++;
else
pred->misses++;
pred->addr_misses= (pred->misses + pred->dir_hits - pred->addr_hits);
-----------
-----------
}
8. P a g e | 7
Part 3: Comparison of BTB Performance
The simulation is done for the following configurations of Branch Target Buffer:
Baseline BTB configuration: 64 sets, 2 way associativity
–bpred bimod –bpred:bimod 256 -bpred:btb 64 2
Showing the effect of the number of sets in BTB with the following options
–bpred bimod –bpred:bimod 256 -bpred:btb 32 2
–bpred bimod –bpred:bimod 256 –bpred:btb 128 2
Showing the effect of associativity when the total size of BTB is fixed with the following options
–bpred bimod –bpred:bimod 256 -bpred:btb 32 4
–bpred bimod –bpred:bimod 256 -bpred:btb 128 1
Performance Analysis based on addr_hits
Sr. No. Configuration
Benchmarks
GCC ANAGRAM GO
1 64 sets/2 way 2235498 2771048 1934760
2 32 sets/2 way 2095859 2746365 1832302
3 128 sets/2 way 2389785 2777415 2008597
4 32 sets/4 way 2260256 2775372 1936745
5 128 sets/1 way 2197498 2759944 1893595
Graphical Representation with above addr_hits
0
500000
1000000
1500000
2000000
2500000
3000000
64 sets/2 way 32 sets/2 way 128 sets/2 way 32 sets/4 way 128 sets/1 way
GO
GCC
ANAGRAM
9. P a g e | 8
The above graph shows the behavior of various configurations of Branch Target Buffer
(BTB) for different benchmarks. Among all the three benchmarks, ANAGRAM benchmark has the
highest address hits and the performance is relatively minimum for BTB with 32 sets and 4 way
set associative. GCC benchmark has moderate address hits and the performance is relatively
minimum for BTB with 32 sets and 4 way set associative. GO benchmark has poor address hits
when compared to other benchmark. For this benchmark, the address hits is again minimum for
the configuration of BTB with 32 sets and 4 way set associative.
Comparison of BTB Performance based on addr_misses
Sr. No. Configuration
Benchmarks
GCC ANAGRAM GO
1 64 sets/2 way 1084176 127541 801464
2 32 sets/2 way 1223815 152224 903922
3 128 sets/2 way 929889 121174 727627
4 32 sets/4 way 1059418 123217 799479
5 128 sets/1 way 1122176 138645 842629
Graphical Representation with above addr_misses
From the above graph, as expected, address misses is very optimal for ANAGRAM
benchmark. GCC benchmark has maximum address misses among all the three benchmarks. As
we can see from the graph, decreasing the sets from 64 to 32 increases the miss rate and
increasing the number of set from 64 to 128 decreases the address misses. This is because
capacity misses is reduced by increasing the number of sets. In case of 32 sets/4 way
configuration, even though set is decreased from 64 to 32 the address miss is decreased because
the associativity is increased which reduces the conflict misses. In case of 128 sets/1 way
configuration, due to direct mapping, even the increase in number of set increases the
addr_misses.
0
200000
400000
600000
800000
1000000
1200000
1400000
64 sets/2 way 32 sets/2 way 128 sets/2 way 32 sets/4 way 128 sets/1 way
ANAGRAM
GO
GCC
10. P a g e | 9
Comparison of BTB Performance based on CPI
Sr. No. Configuration
Benchmarks
GCC ANAGRAM GO
1 64 sets/2 way 0. 9500 0. 4674 0. 7571
2 32 sets/2 way 0. 9664 0. 4711 0. 7645
3 128 sets/2 way 0. 9304 0. 4664 0. 7496
4 32 sets/4 way 0. 9491 0. 4670 0. 7575
5 128 sets/1 way 0. 9528 0. 4686 0. 7583
Graphical Representation with above CPI
From the above graph, CPI remains fairly constant for every benchmark. Among the
benchmarks, ANAGRAM benchmark has the most optimal CPI and GCC benchmark holds the
maximum CPI for execution with various BTB configurations. The CPI seems to be higher for
configuration 32 sets/2 way compared to the 64 sets/2 way which has much higher sets than this
configuration. In case of 32 sets/4 way and 128 sets/1 way configurations, associativity and
number of sets makes the CPI almost equal to the 64 sets/2 way CPI. For the configuration with
set 128 and associativity 2 the CPI remains much lower than all other configurations.
0
0.2
0.4
0.6
0.8
1
1.2
64 sets/2 way 32 sets/2 way 128 sets/2 way 32 sets/4 way 128 sets/1 way
GCC
ANAGRAM
GO
11. P a g e | 10
Comparison of BTB Performance based on Branch Predictor Hit Rates
Sr. No. Configuration
Benchmarks
GCC ANAGRAM GO
1 64 sets/2 way 0.6779 0.9546 0.6926
2 32 sets/2 way 0.636 0.9476 0.6527
3 128 sets/2 way 0.7221 0.9557 0.7225
4 32 sets/4 way 0.6852 0.9573 0.6931
5 128 sets/1 way 0.665 0.9518 0.6775
Graphical Representation with above Branch Predictor Hit Rates
The above graph clearly shows us that the branch predictor hit rate for all the
benchmarks is relatively low when number of set decreases in a BTB. When we closely observe
the variation in the branch predictor hit rates of different configurations, it is evident that for BTB
configuration, 32 sets and 2 way set associative the branch prediction hit rate is lower for all the
benchmarks.
CONCLUSION
For an optimal branch predictor, it is recommended to have higher sets but at the same time
tradeoff between cost and performance should be taken into consideration.
To have high address hit rates and direction hit rates, the simulation results suggests that
combination of two level and bimodal predictor configuration is better.
0
0.2
0.4
0.6
0.8
1
1.2
64 sets/2
way
32 sets/2
way
128 sets/2
way
32 sets/4
way
128 sets/1
way
GCC
ANAGRAM
GO