This document summarizes several course projects completed by Setiawan Soekamtoputra for their Master's degree. The projects include:
1) Design of a 32-bit pipelined CPU in Verilog including implementation of an ASIC flow, multiplier with accumulator case study, and pipeline optimization case study.
2) Development of a monitor program for the MC68000 processor in assembly language including common memory and register commands and exception handlers.
3) Implementation of a high-performance pipelined MIPS processor in VHDL including hazard detection and data forwarding units to handle data and branch hazards.
4) Network on chip prototype designs including a 3-node partially connected mesh design in SystemC and
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)byteLAKE
byteLAKE's presentation from the PPAM 2019 conference.
Abstract:
The goal of this work is to adapt 4 CFD kernels to the Xilinx ALVEO U250 FPGA, including first-order step of the non-linear iterative upwind advection MPDATA schemes (non-oscillatory forward in time), the divergence part of the matrix-free linear operator formulation in the iterative Krylov scheme, tridiagonal Thomas algorithm for vertical matrix inversion inside preconditioner for the iterative solver, and computation of the psuedovelocity for the second pass of upwind algorithm in MPDATA. All the kernels use 3-dimensional compute domain consisted from 7 to 11 arrays. Since all kernels belong to the group of memory bound algorithms, our main challenge is to provide the highest utilization of global memory bandwidth. Our adaptation allows us to reduce the execution time upto 4x.
Find out more at: www.byteLAKE.com/en/CFD
Foot note:
This is the presentation about the non-AI version of byteLAKE's CFD kernels, highly optimized for Alveo FPGA. Based on this research project and many others in the CFD space, we decided to shift the course of the CFD Suite product development and leverage AI to accelerate computations and enable new possibilities. Instead of adapting CFD solvers to accelerators, we use AI and work on a cross-platform solution. More on the latest: www.byteLAKE.com/en/CFDSuite.
-
Update for 2020: byteLAKE is currently developing CFD Suite as AI for CFD Suite, a collection of AI/ Artificial Intelligence Models to accelerate and enable new features for CFD simulations. It is a cross-platform solution (not only for FPGAs). More: www.byteLAKE.com/en/CFDSuite.
Various processor architectures are described in this presentation. It could be useful for people working for h/w selection and processor identification.
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)byteLAKE
byteLAKE's presentation from the PPAM 2019 conference.
Abstract:
The goal of this work is to adapt 4 CFD kernels to the Xilinx ALVEO U250 FPGA, including first-order step of the non-linear iterative upwind advection MPDATA schemes (non-oscillatory forward in time), the divergence part of the matrix-free linear operator formulation in the iterative Krylov scheme, tridiagonal Thomas algorithm for vertical matrix inversion inside preconditioner for the iterative solver, and computation of the psuedovelocity for the second pass of upwind algorithm in MPDATA. All the kernels use 3-dimensional compute domain consisted from 7 to 11 arrays. Since all kernels belong to the group of memory bound algorithms, our main challenge is to provide the highest utilization of global memory bandwidth. Our adaptation allows us to reduce the execution time upto 4x.
Find out more at: www.byteLAKE.com/en/CFD
Foot note:
This is the presentation about the non-AI version of byteLAKE's CFD kernels, highly optimized for Alveo FPGA. Based on this research project and many others in the CFD space, we decided to shift the course of the CFD Suite product development and leverage AI to accelerate computations and enable new possibilities. Instead of adapting CFD solvers to accelerators, we use AI and work on a cross-platform solution. More on the latest: www.byteLAKE.com/en/CFDSuite.
-
Update for 2020: byteLAKE is currently developing CFD Suite as AI for CFD Suite, a collection of AI/ Artificial Intelligence Models to accelerate and enable new features for CFD simulations. It is a cross-platform solution (not only for FPGAs). More: www.byteLAKE.com/en/CFDSuite.
Various processor architectures are described in this presentation. It could be useful for people working for h/w selection and processor identification.
This is my presentation of a baseband processor, which I have developed as a major project in masters, This presentation, gives you an overview of results and effectiveness, of the processor in respect of FPGA and ASIC level.
⭐⭐⭐⭐⭐ Monitoring of system memory usage embedded in #FPGAVictor Asanza
Introduction:
Field Programmable Gate Array #FPGA
System on Chip #SoC
#Nios_II_Processor
Hard Processor System #HPS
Advanced RISC Machine #ARM
Logical bridges
Share physical resources
Related Works:
Renovell et Al., testing #RAM modules in #FPGA
Focus on functional tests RAM of the FPGA
Wei et Al., RAM memory monitoring
Embedded System from the #HardProcessor
Wang et Al., Real-time applications
Use memory optimized way during the execution of tasks based on SoC architecture
real-time Electrocardiogram #ECG
FPGA with two 8GB Dual Data Rate Synchronous Dynamic Random Access Memories #DDR3 #SDRAM
Results:
As shown in Fig 12, the SRAM is working in the logical part executing several tasks and it is validated that as time passes the memory consumption increases. In addition, the writing times will depend on the amount of memory to be written and this varies according to the task that is being executed by the user or those that he has programmed in the Nios II.
As for the DD3, it is executing the Linux OS as a basis and additionally, a size proportional to the size of the SRAM is reserved for the respective comparisons, so it is observed that it has a higher consumption and longer response times. It should be considered in this comparison that the DD3 in addition to running the OS, also has the web server implemented which consumption varies according to the clients that are connecting to the webpage where it can be seen the memory monitoring of the embedded system. Also, thanks to the part of the HPS it is possible to monitor the memory of the embedded system without affecting its consumption.
As shown in Fig. 13, the SRAM is not under the same workload since it is only responsible for storing what Nios II needs for the execution of the tasks.
Finally, it was consider that the HPS portion to be very important for a clean monitoring not only of the SRAM but also of any core that is implemented in the FPGA portion, since if this application is implemented on a chip that only has FPGA the application would affect the consumption and performance of it, therefore you could not have completely reliable results.
Implementation of Soft-core Processor on FPGADeepak Kumar
We can add a soft-core processor to a FPGA-based system after it's already designed. However, adding a hard-core processor requires either a different FPGA, or an additional chip on the board.
MIPI DevCon 2016: Accelerating Software Development for MIPI CSI-2 CamerasMIPI Alliance
MIPI CSI-2-compliant cameras are popular in mobile and mobile-influenced devices because of the specification’s ability to handle high image resolution over fast links with low-power consumption. SoC designers can accelerate their design process by integrating the software drivers to make initial development easier and directly control boot-up sequences. This presentation by Licinio Sousa of Synopsys describes how to use the existing host-side V4L2 API and V4L2 subdevice interfaces to ease the integration of a CSI-2-compliant camera with an existing system. This approach allows designers to easily change their camera without having to make any changes to the CSI-2 host driver.
The use of embedded and removable card universal flash storage (UFS) in the fast-moving mobile market is growing, and designers are looking for ways to accelerate their design development and verification process. In this presentation, Rui Terra of Synopsys describes how using FPGA-based prototyping systems with pre-verified UFS and UniPro IP reference designs enable designers to easily develop their required software, test their device’s interoperability and ensure compliance.
The Microarchitecure Of FPGA Based Soft ProcessorDeepak Tomar
this presentation is on the Paper "The Microarchitecure Of FPGA Based Soft Processor" by Peter Yiannacouras, Jonathan Rose and
J Gregory Steffan
Dept. of Electrical and Computer Engineering
University of Toronto
Designed a fully customized 128x10b SRAM by constructing schematic & virtuoso layout of memory cell array (6T cell), row & column decoder, pre-charge circuit, write circuit and sense amplifier using Cadence. Manually placed and routed all components, performed DRC & LVS debugging of constructed schematic and layout and ran PEX to generate the final Netlist, Hspice Spectre simulation of final design for verification of the correct functionality and analysis of best read, best write cycles & the worst case timing for read and write. Timing and power consumed is analyzed through STA-Primetime (Static timing Analysis)
This is my presentation of a baseband processor, which I have developed as a major project in masters, This presentation, gives you an overview of results and effectiveness, of the processor in respect of FPGA and ASIC level.
⭐⭐⭐⭐⭐ Monitoring of system memory usage embedded in #FPGAVictor Asanza
Introduction:
Field Programmable Gate Array #FPGA
System on Chip #SoC
#Nios_II_Processor
Hard Processor System #HPS
Advanced RISC Machine #ARM
Logical bridges
Share physical resources
Related Works:
Renovell et Al., testing #RAM modules in #FPGA
Focus on functional tests RAM of the FPGA
Wei et Al., RAM memory monitoring
Embedded System from the #HardProcessor
Wang et Al., Real-time applications
Use memory optimized way during the execution of tasks based on SoC architecture
real-time Electrocardiogram #ECG
FPGA with two 8GB Dual Data Rate Synchronous Dynamic Random Access Memories #DDR3 #SDRAM
Results:
As shown in Fig 12, the SRAM is working in the logical part executing several tasks and it is validated that as time passes the memory consumption increases. In addition, the writing times will depend on the amount of memory to be written and this varies according to the task that is being executed by the user or those that he has programmed in the Nios II.
As for the DD3, it is executing the Linux OS as a basis and additionally, a size proportional to the size of the SRAM is reserved for the respective comparisons, so it is observed that it has a higher consumption and longer response times. It should be considered in this comparison that the DD3 in addition to running the OS, also has the web server implemented which consumption varies according to the clients that are connecting to the webpage where it can be seen the memory monitoring of the embedded system. Also, thanks to the part of the HPS it is possible to monitor the memory of the embedded system without affecting its consumption.
As shown in Fig. 13, the SRAM is not under the same workload since it is only responsible for storing what Nios II needs for the execution of the tasks.
Finally, it was consider that the HPS portion to be very important for a clean monitoring not only of the SRAM but also of any core that is implemented in the FPGA portion, since if this application is implemented on a chip that only has FPGA the application would affect the consumption and performance of it, therefore you could not have completely reliable results.
Implementation of Soft-core Processor on FPGADeepak Kumar
We can add a soft-core processor to a FPGA-based system after it's already designed. However, adding a hard-core processor requires either a different FPGA, or an additional chip on the board.
MIPI DevCon 2016: Accelerating Software Development for MIPI CSI-2 CamerasMIPI Alliance
MIPI CSI-2-compliant cameras are popular in mobile and mobile-influenced devices because of the specification’s ability to handle high image resolution over fast links with low-power consumption. SoC designers can accelerate their design process by integrating the software drivers to make initial development easier and directly control boot-up sequences. This presentation by Licinio Sousa of Synopsys describes how to use the existing host-side V4L2 API and V4L2 subdevice interfaces to ease the integration of a CSI-2-compliant camera with an existing system. This approach allows designers to easily change their camera without having to make any changes to the CSI-2 host driver.
The use of embedded and removable card universal flash storage (UFS) in the fast-moving mobile market is growing, and designers are looking for ways to accelerate their design development and verification process. In this presentation, Rui Terra of Synopsys describes how using FPGA-based prototyping systems with pre-verified UFS and UniPro IP reference designs enable designers to easily develop their required software, test their device’s interoperability and ensure compliance.
The Microarchitecure Of FPGA Based Soft ProcessorDeepak Tomar
this presentation is on the Paper "The Microarchitecure Of FPGA Based Soft Processor" by Peter Yiannacouras, Jonathan Rose and
J Gregory Steffan
Dept. of Electrical and Computer Engineering
University of Toronto
Designed a fully customized 128x10b SRAM by constructing schematic & virtuoso layout of memory cell array (6T cell), row & column decoder, pre-charge circuit, write circuit and sense amplifier using Cadence. Manually placed and routed all components, performed DRC & LVS debugging of constructed schematic and layout and ran PEX to generate the final Netlist, Hspice Spectre simulation of final design for verification of the correct functionality and analysis of best read, best write cycles & the worst case timing for read and write. Timing and power consumed is analyzed through STA-Primetime (Static timing Analysis)
In the project#1, IBM 130nm process is used to design and manual layout a 128 word SRAM, with word size 10bits. Cadence's Virtuoso is applied for layout editing, DRC and LVS running and circuit simulation.
Microchip's PIC Micro Controller - Presentation Covers- Embedded system,Application, Harvard and Von Newman Architecture, PIC Microcontroller Instruction Set, PIC assembly language programming, PIC Basic circuit design and its programming etc.
In this slide deck we cover:
- Understanding the relationship between OFDM theory and practice
- Starting from a Matlab script through to automatic HDL code/bitstream generation
- Introduction to Nutaq’s PicoSDR hardware and software
- Creating host applications to exchange data with the PicoSDR in real-time
digital signal processing
Computer Architectures for signal processing
Harvard Architecture, Pipelining, Multiplier
Accumulator, Special Instructions for DSP, extended
Parallelism,General Purpose DSP Processors,
Implementation of DSP Algorithms for var
ious operations,Special purpose DSP
Hardware,Hardware Digital filters and FFT processors,
Case study and overview of TMS320
series processor, ADSP 21XX processor
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
In this deck from the Hot Chips conference, Chris Nicol from Wave Computing presents: A Dataflow Processing Chip for Training Deep Neural Networks.
Watch the video: https://wp.me/p3RLHQ-k6W
Learn more: https://wavecomp.ai/
and
http://www.hotchips.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Knowing what's inside and how it works will help you design, develop, and implement applications better, faster, cheaper, more efficient, and easier to use because you will be able to make informed decisions instead of guestimating and assuming.
This presentation discussed the Pentium Processor Family as requirement of the Micro-controller Course in Technological University of the Philippines. It covers the history of Pentium family of processors, list of Intel processors, features of the processors, architecture, modes, pipeline and trends.
High Performance Erlang - Pitfalls and SolutionsYinghai Lu
Presented at Erlang Factory 2016, San Francisco, CA.
Erlang is widely used for building concurrent applications. However, when we push the performance of our Erlang based application to handle millions of concurrent clients, some Erlang scalability issues begin to show and some conventional programming paradigm of Erlang no longer hold. We would like to share some of these issue and how we address them. In addition, we share some of our experience on how to profile an Erlang application to identify bottlenecks.
We will take a deep look at some of the basic mechanisms of Erlang and show how they behave under high load and parallelism, which includes message delivery, process management and shared data structures such as maps and ETS tables. We will demonstrate their limitations and propose techniques to alleviate the issues.
We will also share profiling techniques on how to find those bottlenecks in Erlang applications across different levels. We will share techniques for writing highly performant Erlang applications.
Computer organization & ARM microcontrollers module 3 PPT
Summary Of Course Projects
1. SUMMARY OF COURSE
PROJECTS
SETIAWAN SOEKAMTOPUTRA
MASTER OF ELECTRICAL AND COMPUTER
ENGINEERING
ILLINOIS INSTITUTE OF TECHNOLOGY
DECEMBER 2010 GRADUATE
2. CONTENTS
• 32-bit Pipelined CPU
• MC68K-Based Monitor Program
• Pipelined MIPS Processor with hazard handler and data
forwarding
• Simple Mesh-Like and Ring-Like Network on Chip Design
• Small office network design
• 4-bit 10t adder circuit with dual-vt logic design
• Single-ended 6T vs. standard 6T SRAM bitcell design
• QR Matrix Factorization
• Electro Active Polymer Energy Harvesting Design
• Advanced Encryption Standard Hardware Design
2
3. SPRING 2009
• Introduction to VLSI Design
• 32-bit Pipelined CPU
• Multiplier with accumulator and pipeline optimization
• Microcomputer
• MC68K-Based Monitor Program
• Advanced Computer Architecture
• Pipelined MIPS Processor with hazard handler and data
forwarding
Return 3
4. 32-BIT PIPELINED CPU
• Hardware Description Language
• Verilog
• Tools
• Compiler: Cadence Verilog XL
• Logic Synthesis: Synopsys Design Compiler
• Simulation tool: Cadence‟s SimVision, Mentor Graphics
Modelsim
• Place and Route: Cadence SOC Encounter
• Mentor Graphic‟s Modelsim
• Objectives
• Execute ASIC Flow in this implementation using verilog
• RTL, post-synthesis, and post-PR simulation for verification
• Determine maximum frequency, area, delay, and power
Return 4
5. 32-BIT PIPELINED CPU
• 32-bit Memory File
• Eight ALU functions: multiplication, add, subtraction,
OR, AND, XOR, XNOR
• M:multiplicand, N: multiplier
• Multiplier:
• Radix 2r produce N/r partial products
• Radix-4 booth-encoded Multiplier Reduces number of
partial products (N/2 vs. N)
• Wallace Tree Reduces number of logic levels required to
perform summation
Return 5
10. 32-BIT PIPELINED CPU
• Case studies:
• Case 1: Modify ALU multiplier to multiplier with accumulator
(MAC) (useful for implementing DSP)
• Case 2: Pipeline optimization
• MAC benefit: reduces #instruction sets to compute
the final result of sum of product functions.
• Pipeline optimization is applied by inserting registers
at the critical path (in this case MAC unit)
Return 10
14. 32-BIT PIPELINED CPU
• Provided:
• Multiplier accumulator block diagram
• Simple CPU design written in verilog
• All required tools
• Implementation
• Construct fore-mentioned unit in verilog and modify the
design to fit new unit
• Apply numbers of registers for pipelining
• Design functionality Test
• Verify in sumulation that function F= (-10)* 5 + (-60)*2 + (-
60)*8 outputs the correct result
Return 14
16. 32-BIT PIPELINED CPU
• Additional Analysis Result
• Finding the maximum frequency
• Expected maximum frequency of the design: 58 MHz
• Frequency vs. area vs. power consumption
Return 16
17. MC68K-BASED MONITOR PROGRAM
• instructor: Dr. Jafar Saniie
• Requirements/Specifications
• Construct a simple monitor program for MC68000 processor
that allows user to execute common memory and register
accesses, basic exception handlers.
• Language
• 68000 assembly language
• Tools
• Easy68k Editor/Assembler/Simulator
Return 17
20. MC68K-BASED MONITOR PROGRAM
• Includes command interpreter that check and validate
user inputs.
• Monitor debugger commands:
• MEMD Memory display
• MEMS Memory Set
• SORT Memory Sort
• FILL Memory Fill
• MOVE Memory move
• MEMM Memory Modify
• FIND Block Memory Search
• REGM Register Modify
• REGD Register Display
• RUNS Execute program at specified location
Return 20
21. MC68K-BASED MONITOR PROGRAM
• Monitor debugger Exception handling commands:
• TBUS Bus Error Exception
• TADD Address Error
• TILL Illegal Exception
• TPRI Privilege Violation
• TDIV Division by Zero
Return 21
22. MC68K-BASED MONITOR PROGRAM
• Results (partial of 17 commands made)
Register display
Memory display
Return Command interpreter
22
23. HIGH-PERFORMANCE PIPELINED
MIPS PROCESSOR
• MIPS (Microprocessor without Interlocked Pipeline Stages) is a
reduced instruction set computer (RISC) instruction set
architecture (ISA)
• instructor: Prof. Jia Wang
• Requirements/Specifications
• Design a MIPS processor with pipeline, data forwarding, and hazard
handling capabilities.
• Run RTL Simulation to verify the functionalities
• Language
• VHDL
• Tools
• Modelsim PE 6.5
• MARS 3.6 MIPS Simulator
• Provided:
• Data memory unit design
• Testbench code
Return 23
24. HIGH-PERFORMANCE PIPELINED
MIPS PROCESSOR
• Data width: 32-bit
• Branch Hazard
• 5-stage pipeline
• Instruction Fetch • Branch calculation occurred in
• Instruction Decode Instruction Decode Stage
• Execute
• Memory Access
• Branch miss only costs one cycle
• Write-Back of stall.
• Main Modules • Data Hazard
• Program counter (PC)
• Control Unit • Stall if data being written is going
• ALU Control Unit to be used at the next instruction
• Register File
• ALU • Data Forwarding
• Instruction Memory
• Data Memory
• Result data is used immediately
• Hazard Detection Unit rather than written back to
• Forwarding Unit register file first.
Return 24
28. FALL 2009
• Hardware/Software Co-Design
• Simple Mesh-Like Network on Chip Design
• Simple Ring-Like Network on Chip Design
• Introduction to Computer Network
• Design of 2-story small office computer network
Return 28
29. HARDWARE/SOFTWARE CO-
DESIGN
• Projects:
• Network on chip prototype design with three
nodes
• Simple Mesh-Like Network on Chip Design
Return 29
30. NETWORK ON CHIP PROTOTYPE
DESIGN WITH THREE NODES
• Instructor: Prof. Jia Wang
• Specifications
• Three-node in partially connected mesh topology NoC
architecture
• Three processing elements and three routers.
• Queue system: FIFO
• Language
• SystemC running on Visual C++
• Tools
• Microsoft Visual C++
Return 30
31. NETWORK ON CHIP PROTOTYPE
DESIGN WITH THREE NODES
• Three-node NoC System Diagram
• Third node function (called PE_dumpbox)
• It receives all packets that cannot be processed by the
destination processing unit due to overloading in the network
Return 31
32. NETWORK ON CHIP PROTOTYPE
DESIGN WITH THREE NODES
• Results
• Overload in Router 1 network
buffer at cycle 3
• 3rd processing unit
PE_dumpbox receives
packet
Return 32
33. MESH-LIKE NETWORK ON CHIP
PROTOTYPE DESIGN
• Specifications
• a simple mesh-like NoC architecture.
• One router has one processing unit (PE).
• Queue system: FIFO
• 4 by 4 matrix-like size
• Language
• SystemC
• Tools
• Microsoft Visual C++
Return 33
35. MESH-LIKE NETWORK ON CHIP
PROTOTYPE DESIGN
• Results
• Generated packets
• Result shows packets are
delivered
Return 35
36. MESH-LIKE NETWORK ON CHIP
PROTOTYPE DESIGN
• Results
• Delays due to the fact
that only one packet is
delivered to processing
element PE at a time
Return 36
37. MESH-LIKE NETWORK ON CHIP
PROTOTYPE DESIGN
• Benefit and drawback:
• Packet arrives in the destination address with fewer hops
reducing contention and increasing average bit rate.
• Increases the complexity of the design and more wires
are needed.
Return 37
38. INTRODUCTION TO COMPUTER
NETWORK
• Project:
• Design a prototype of 2-story small office computer network
capable of serving 20 users with three department LANs,
four servers and wireless Internet
• Language
• N/A
• Tools
• Microsoft Visio
Return 38
39. SMALL OFFICE NETWORK DESIGN
• Proposed configurations
• IP address allocation
Return 39
41. SMALL OFFICE NETWORK DESIGN
• Office Layout
2nd floor
Colored arrows show how
1st floor cables are managed
Return 41
42. SPRING 2010
• Advanced VLSI
• 4-bit 10t adder circuit with dual-vt logic design
• High Performance VLSI IC System
• Single-ended 6T vs. standard 6T SRAM bitcell design
comparison
• QR Factorization
• Implementing QR factorization algorithm in C
Return 42
43. 4-BIT 10T ADDER CIRCUIT WITH
DUAL-VT LOGIC DESIGN
• Project:
• 4-bit 10t adder circuit with dual-vt logic design
• Specifications
• Adder circuit is based on:
J. Lin, M. Sheu, and C.Ho. A Novel High-Speed and Energy Efficient 10-Transistor Full
Adder Design. IEEE Trans. on Circuits and Systems, May 2007.
• Adder: cascaded Carry ripple Adders
• Technology node: 45nm (FreePDK)
• Voltage: 1.1V @ 25 MHz
• Performance measurements (delay and power consumption) for 10T
Adder Circuit using high-threshold (Vt), low-Vt, and dual-Vt transistors
• Tools
• Cadence Virtuoso Schematic Design
• Synopsys HSPICE Simulator
• Nanosim Simulator
Return 43
44. 4-BIT 10T ADDER CIRCUIT WITH
DUAL-VT LOGIC DESIGN
• High Vt vs. low Vt
• Full Adder Design (1-bit)
• Complementary and level restoring carry logic (CLRCL)
Return 44
45. 4-BIT 10T ADDER CIRCUIT WITH
DUAL-VT LOGIC DESIGN
• Full Adder Design (1-bit) Critical Path
• Dual-VT: Low-VT apply on transistors which are in critical path for
speed and High-VT for others for low leakage
• NMOS at multiplexer and PMOS in inverter are low-VT transistors
Return 45
46. 4-BIT 10T ADDER CIRCUIT WITH
DUAL-VT LOGIC DESIGN
• Logic Equation
Sum = (A XNOR B).Cin + (A XOR B). Cin_bar
Cout= (A XOR B) .Cin + (A XNOR B).A
• Design Components
• Inverter (left) and multiplexer (right)
Return 46
47. 4-BIT 10T ADDER CIRCUIT WITH
DUAL-VT LOGIC DESIGN
• 1-bit Full Adder (consisting of multiplexers and
inversters) and its symbol
• 4-bit Full Adder
Return 47
48. 4-BIT 10T ADDER CIRCUIT WITH
DUAL-VT LOGIC DESIGN
• Methodology
• Using combination of input vector to measure delay and
power consumptions
• Delay : Switching delay between least significant bit (bit 0)
and most significant bit (bit 3)
• Power : Average and maximum power during simulation
• Results 4.00E-10
3.50E-10
• Delay (in seconds)
3.00E-10
2.50E-10
High-VT
2.00E-10
Low-VT
1.50E-10
Dual-VT
1.00E-10
5.00E-11
0.00E+00
High-to-Low Low-to-High
Return 48
49. 4-BIT 10T ADDER CIRCUIT WITH
DUAL-VT LOGIC DESIGN
• Results
• Power consumption (in Watt)
6.00E-05 5.00E-04
4.50E-04
5.00E-05
4.00E-04
4.00E-05 3.50E-04
3.00E-04
3.00E-05 High-VT 2.50E-04 High-VT
Low-VT 2.00E-04
2.00E-05 Low-VT
1.50E-04
Dual-VT Dual-VT
1.00E-05 1.00E-04
5.00E-05
0.00E+00 0.00E+00
Average Power Maximum Power
Return 49
51. 4-BIT 10T ADDER CIRCUIT WITH
DUAL-VT LOGIC DESIGN
• Issue
• Voltage degradation specifically for high-vt or high
frequency (> 125 MHz) due to pass transistors behavior to
deliver weak-1 (NMOS) or weak-0 (PMOS).
Return 51
52. SINGLE-ENDED 6T VS. STANDARD 6T
SRAM BITCELL DESIGN
• Specifications
• Design from:
J. Singh, et al. Single Ended 6T SRAM with Isolated Read-Port for Low-
Power Embedded Systems. IEEE. 2009
• Technology node: 45nm
• Use: high VT MOSFET
• Tools
• Cadence Virtuoso Schematic Design
• Synopsys HSPICE Simulator
Return 52
53. SINGLE-ENDED 6T VS. STANDARD 6T
SRAM BITCELL DESIGN
• Background
• SRAM consumes majority of die area
• Dynamic power via reads and writes activities
• Static power : retaining its logic value
• Benefits/Drawbacks of Single-Ended SRAM
• Faster reading logic „1‟
• One bit line (no complementary bit bar line) wire
reduction
• More delay in Writing „1‟ due to weak-1 behavior of pass
transistor NMOS (but around 85% of writes are zero writes)
• Role of Isolated Read Port: Prevents bitcell content to be
exposed during READs
• Considerable lower power dissipation, better read SNM
Return 53
60. SINGLE-ENDED 6T VS. STANDARD 6T
SRAM BITCELL DESIGN
• Standard SRAM Design (using Cadence Virtuoso)
Return 60
61. SINGLE-ENDED 6T VS. STANDARD 6T
SRAM BITCELL DESIGN
• Single-Ended SRAM Design
Return 61
62. SINGLE-ENDED 6T VS. STANDARD 6T
SRAM BITCELL DESIGN
• Comparison Results
• Write Delay (0 to 0.5Vdd or 1 to 0.5Vdd)
“…around 85% of the instruction write bits are “0,” and over 90% of the data
write bits are “0.”.. “ (quoted from [3])
[3] Y. Chang, F. Lai, C. Yang. Zero-Aware Asymmetric SRAM Cell for
Reducing Cache Power in Writing Zero. IEEE Trans. On VLSI
Systems, Vol.12, No.8, August 2004.
Return 62
63. SINGLE-ENDED 6T VS. STANDARD 6T
SRAM BITCELL DESIGN
• Comparison Results
• Power Consumption Comparison
Return 63
64. SINGLE-ENDED 6T VS. STANDARD 6T
SRAM BITCELL DESIGN
• Noise Margin
Return 64
65. QR MATRIX FACTORIZATION
• Purposes:
• Implementing QR factorization algorithm in C
• Specifications
• Written in C under RedHat OS
• QR Factorization
• Decomposition method of a matrix to solve linear problems or
equations without inverting one of the left-hand side matrix.
• Applicable to: m-by-n matrix A
• Decomposition: A = QR where Q is an orthogonal matrix of size m-by-
m, and R is an upper triangular
• The QR decomposition provides an alternative way of solving the
system of equations Ax = b without inverting the matrix A. The fact that
Q is orthogonal means that QTQ = I, so that Ax = b is
• equivalent to Rx = QTb, which is easier to solve since R is triangular.
Return 65
68. FALL 2010
• Electro Active Polymer Energy Harvesting
• Advanced Encryption Standard
Return 68
69. ELECTRO ACTIVE POLYMER
ENERGY HARVESTING DESIGN
• EAP Circuitry provides mechanical to electrical
energy conversion when it is stretched, given bias
voltage.
• EAP material VHB 4905 tape and carbon grease
Return 69
70. ELECTRO ACTIVE POLYMER
ENERGY HARVESTING DESIGN
• Previous prototype: • Drawbacks
• High energy consumption
• Charge management • EAP output power is too
IC: TI‟s bq2000 small to even turn on battery
• Li-ion battery 3V, 45mAh charging circuit (which
needs 20.6 mA)
• Application: TI‟s eZ430- • Solutions
F2013 • EAP material efficiency
• Boost Converter to • Higher capacitance
supply biasing voltage (5 • Battery and circuit that can
V 1.5KV): store small energy without
requiring much energy to
• EMCO Q15N-5 operate
• Apply low biasing voltage
eliminate use of boost
converter
Return 70
71. ELECTRO ACTIVE POLYMER
ENERGY HARVESTING DESIGN
• Simulation model using Simulink
• Circuit model parameters:
• EAP Model parameters, input voltage (battery), and output
capacitor Co
Return 71
72. ELECTRO ACTIVE POLYMER
ENERGY HARVESTING DESIGN
• Simulation model using Simulink
• EAP Model Parameters:
• Cidle, Cforced, force frequency f(how often the EAP is stretched)
• Absolute function to create always-positive sine waveform from
original sine wave
Return 72
78. ADVANCED ENCRYPTION STANDARD
HARDWARE DESIGN
• Variant AES with 512-bit and 1024-bit key
• Area and power consumption comparison with 128-bit
and 256-bit AES keys
• CMOS technology : 45nm
• Operating Voltage : 1.1 V @ 100 MHz
• Verilog language
• Tools:
• Synthesis : Synopsys DC Compiler
• Simulation : Modelsim
• Find the relationship between key size and implemented
hardware area and power consumption.
Return 78
79. ADVANCED ENCRYPTION STANDARD
HARDWARE DESIGN
Cipher Key Plaintext
• Initial Round
Key Expansion RoundKey[0] AddRoundKey
Normal Round
SubBytes
ShiftRows
MixColumns i=i+1
RoundKey[i] AddRoundKey
yes
i < Number of
rounds?
Final Round
No
SubBytes
ShiftRows
AddRoundKey
Ciphered Text
Return 79
81. ADVANCED ENCRYPTION STANDARD
HARDWARE DESIGN
• Block Diagram
SubBytes
Mux
Plain_text AddRoundKey and MixColumns
ShiftRows AddRoundKey
Mux
Ciphered
Mux
Initial _text
Key Expansion Module value
Cipher_key
(zero)
Return 81
82. ADVANCED ENCRYPTION STANDARD
HARDWARE DESIGN
7
Results 6
y = 0.852x + 2.739
R² = 0.985
5 100000
95000
4 90000 power (dynamic) in mW
85000
80000 power (static) in mW
3 75000 Total Power in mW
70000
65000 Linear (Total Power in mW)
2
60000
55000
1 50000
AES128 AES256 AES512 AES1024
area 58824.876 64188.036 76881.193 96312.560
0
AES128 AES256 AES512 AES1024
power (dynamic) in mW power (static) in mW Total Power in mW
AES128 3.3574 0.2971603 3.6545603
AES256 3.9442 0.3341722 4.2783722
AES512 5.0289 0.409219 5.438119
AES1024 5.6042 0.5053051 6.1095051
Return 82
83. ADVANCED ENCRYPTION STANDARD
HARDWARE DESIGN
Results: Area
100000
95000
90000
85000
80000
75000
70000
65000
60000
55000
50000
AES128 AES256 AES512 AES1024
area 58824.87654 64188.0369 76881.19388 96312.56036
Return 83