Research presentation

Nirav A. Desai desai.nirav.12.09@gmail.com1

MM-Wave Active Sensor: BPSK Spectrum can be seen in the Spectrum Analyzer
Work on the design done
together by:
Nirav Desai, Munkyo Seo,
Colin Sheldon, Mark Rodwell

I assisted in these mm-wave MIMO
experiments at UCSB

EE 5323: VLSI DESIGN 1 PROJECT
Course Instructor: Prof. Chris Kim
16-bit BRENT KUNG ADDER DESIGN in 45nM CMOS
Nirav Desai
ID: 4280229
Department of Electrical and Computer Engineering
University of Minnesota

Brent Kung Adder Gate Level Diagram
1. Input Block with Pre Computation
Input Adder Chain 1
Input Adder Chain 2
Input Adder Chain 3
Input Adder Chain 4
1X
1X
1X
1X
1.224X
1.562X
1.23X
1.274X
1.097X
1.553X
1.108X
1.034X
3.883X
3.043X
2.943X
10.1683X
10.8506X
36X
40X
Output Buffers to drive
Capacitive Loads
Capacitive Loads
Pi*Pi-1
Gi + Pi*Gi-1

2. Intermediate Dot Product Blocks
Intermediate Adder Chain 1
Intermediate Adder Chain 2
1X
1X
1X
1X
1.72X
6X
4X
16X
16X
Capacitive Loads
Pi*Pi-1
Gi + Pi*Gi-1

3. Output Block for Post Computation
1.182X
1.117X
Ci-1
Pi
Capacitive Loads
Si

Brent Kung Adder Transistor Level Design
XOR GATE

Inverter Design Optimization
• NMOS Width = 90nm
• PMOS / NMOS Length = 50nM
• Vdd = 1.1V
• Current Averaged Over
One Period of 2 ns
• Optimal PMOS Width = 165nM
• βinverter = 165/90 = 1.834
• Sizing for NAND, NOR and XOR
Changed appropriately

1. Input Block with Pre Computation
Input Adder Block Chain 1
Gate Number 1.000 2.000 3.000 4.000 5.000 Stage G Stage F Stage B Stage H Gate H
Gate Name BUFFER INVERTER NOR INVERTER NAND LOAD h
g value 1.000 1.000 1.646 1.000 1.352 36.000 2.225 36.000 6.943 556.248 3.540
f value 3.540 3.540 2.151 3.540 2.618648
b value 2.893 2.400 1.000 1.000 1.000 1.000
S Value 1.000 1.224 1.097 3.883 10.16831 36.000
Gate Number 1.000 2.000 3.000 4.000 Stage G Stage F Stage B Stage H Gate H
Gate Name BUFFER INVERTER XOR NAND LOAD h
g value 1.000 1.000 1.893 1.295 13.748 2.451 13.748 12.359 416.510 4.518
f value 4.518 4.518 2.386 3.488
b value 2.893 2.400 1.780 1.000 1.000
S Value 1.000 1.562 1.553 3.043 13.748
Gate Number 1.000 2.000 3.000 Stage G Stage F Stage B Stage H Gate H
Gate Name BUFFER INVERTER NOR LOAD h
g value 1.000 1.000 1.646 3.941 1.646 3.941 6.943 45.038 3.558
f value 3.558 3.558 2.162
b value 2.893 2.400 1.000
S Value 1.000 1.230 1.108 3.941
Gate Number 1.000 2.000 3.000 4.000 5.000 Stage G Stage F Stage B Stage H Gate H
Gate Name BUFFER INVERTER XOR NAND INVERTER LOAD h
g value 1.000 1.000 1.893 1.295 1.000 40.000 2.451 40.000 6.943 680.832 3.686
f value 3.686 3.686 1.947 2.847 3.686447
b value 2.893 2.400 1.000 1.000 1.000 1.000
S Value 1.000 1.274 1.034 2.943 10.85056 40.000
3.94084
Logical Effort Design for Signal
Chains labeled in previous slide #2

2. Intermediate Dot Product Blocks
Logical Effort Design for Signal
Chains labeled in previous slide #3
Intermediate Adder Block Chain 1
Gate Number 1.000 2.000 Stage G Stage F Stage B Stage H Gate H
Gate Name INVERTER NAND LOAD h
g value 1.000 1.352 1.000 1.352 6.000 1.000 8.112 2.848
f value 2.848 2.107 2.848
b value 1.000 1.000 1.000
S Value 1.000 2.107 6.000
Intermediate Adder Block Chain 2
Gate Number 1.000 2.000 Stage G Stage F Stage B Stage H Gate H
Gate Name BUFFER NAND LOAD h
g value 1.000 1.352 2.848 1.352 2.848 2.000 7.701 2.775
f value 2.775 2.053
b value 2.000 1.000
S Value 1.000 1.026

Brent Kung Adder Simulated Performance
Voltage (V) Delay Max-C14
(nS)
Power Max
(mW)
Power-Delay
Product (xE-12)
1.1 0.359 6.73 2.41
0.9 0.503 2.95 1.483
0.7 0.937 0.924 0.865
Simulations with maximally sized 1 stage buffers as determined by Logical Effort Design
of individual chains
Voltage (V) Delay Max-C14
(nS)
Power Max
(mW)
Power-Delay
Product (xE-12)
1.1 0.403 5.186 2.089
0.9 0.569 2.277 1.295
0.7 1.069 0.692 0.739
Simulations with minimally sized 1 stage buffers
Without Parasitic Extraction and Interconnect Parasitics buffering doesn’t improve performance significantly.

Brent Kung Adder Worst Case Delay
Input Pattern: A: FFFF B: 0000 -> 0001
Dotted Lines show Carry Bits 15 and 14
Carry Bit 15 Carry Bit 14

Brent Kung Adder Layout
Input Block with Pre Computation
Input Inverters for Bit 0 and Bit 1
Output Buffers
PEX waveforms show
larger size may be needed
XOR
NAND
10X

XOR 1.553X

NAND 10.57X Layout with inter digitated fingers to reduce parasitics

Intermediate Dot Product Generator
Output Buffers
PEX Waveforms
show larger
Size may be necessary
here

Output Stage with Buffers

Full Layout: 49.5um X 48.6um

Future Design Modifications
• The design uses large buffers at the output of every
stage to drive large capacitances
• The buffers are not needed at nodes with low fanouts
and can be eliminated.
• The buffers at input nodes right now cause more power
consumption and add to the delay .
• Thus the overall performance can be improved with fewer buffers.

References:
Course Slides from Prof. Kia Bazargan’s Course
on VLSI
A Taxonomy of Parallel Prefix Networks
(David Harris ) – Reference paper on course
website
Digital Integrated Circuits by Jan Rabaey

SRAM DESIGN PROJECT PHASE 2
Nirav Desai
4280229
VLSI DESIGN 2: Prof. Kia Bazargan
Dept. of ECE
College of Science and Engineering
University of Minnesota, Twin Cities
43

SRAM CELL READ AND WRITE MARGIN FROM BUTTERFLY CURVE
•NMOS inverter = 110nM PMOS inverter = 220nM NMOS Access = 90nM
•NMOSinv/NMOSaccess = 1.2 PMOSinv/NMOSaccess=2.4
•Cbitline = 0.747fF for 512 cell array ( Interconnect Parasitics from ASU PTM Website )

SRAM CELL READ AND WRITE MARGIN FROM BUTTERFLY CURVE
•NMOS inverter = 150nM PMOS inverter = 555nM NMOS Access = 180nM
•NMOSinv/NMOSaccess = 1.2 PMOSinv/NMOSaccess = 3 Cbitline = 0.747fF
•Curve shows SRAM cell is close to write failure.
•Bitline Precharge to less than 1.1V could be explored to increase SNM.

Simulation Setup
• M0,M1,M3,M4 form the cross coupled inverter pair
• M5,M6 are access transistors
• C1, C2 is the bitline capacitance
• M7 is the precharge switch for bitline ( bit ) - V3 precharges the bitline to 0.8V
• V6 precharges bitbar and writes a 0 to the cell
V(write)
V(ic) V(word)
V(qbar)
V(q)
V(bitbar)V(bit)

Timing Waveforms for Characterization
V(write) – Applied to source of M7 (precharge switch)
V(word) – Wordline Voltage
V(qbar)
V(q)
V(ic) – Enables the precharge switch M7
V(bitbar)
V(bit)
• V(write) precharges Cbit to 0.8V via M7
• V(word) disables access transistors
M5 and M6 during precharge .
• V(qbar) and V(q) are used to generate
the butterfly curves.
• V(ic) enables M7 during precharge
It could be implemented as
NOT(V(word)).
• V(bitbar) precharges to 0.8V, shows
charge pumping when M7 turns off and
follows V(qbar) when wordline is
enabled.
• V(bit) follows V(q) after word line is
enabled.
• V(bit) precharged to Vdd by V6

PASS TRANSISTOR BASED TREE DESIGN
1:8 Row Decoder Tree
Similar Tree Decoder for 16 LSB Bits

TREE DECODER DESIGN

PASS TRANSISTOR BASED TREE DESIGN
IN OUT
CK
CK
50
880
=
L
W
Identical Sizing for NMOS and PMOS to minimize charge injection effects
• Delay drops by ~40ps/2 for every
Doubling of transistor widths
• Delay drop saturates around
1000nM to 89ps
• Used W/L of 880/50 for final tree

TREE DECODER TIMING DIAGRAMS
The following waveforms were applied to the row and column selection inputs of the tree decoder

It takes one cycle for initializing
the tree decoder after which we get clean pulses for each row output
LSB pulse is wider than MSB pulse in bottom figure to allow the tree to clear present state before next

The top waveforms shows the matrix point output where the row and column select inputs are high
The output node discharges when the input goes low

READ WRITE CIRCUIT
( Design by Bong Jin )
Sense Amplifier Write Driver
Precharge Circuit

READ WRITE CIRCUIT TEST SETUP
Bitline Capacitance estimate from ASU PTM Website
Cbit estimate for 512 rows
NMOS Switches to allow read without disabling write circuit
Single SRAM Cell for
simulations

READ / WRITE TIMING WAVEFORMS
Precharge Pulse ( Active Low )
Data Meant to be written to cell
Write Enable Pulse
Read Enable Pulse
Output of Write Buffer
Disable output buffer ( tristate logic
Bitline
Bitline Bar
Data Output
Data Out Bar

SRAM Cell Layout

2X2 SRAM Array Layout
VDD
GND
GND
WORD 1
WORD 0
B0 B0BAR B1 B1BAR
This unit can be replicated in all directions without any changes. LVS check remaining
Array Size = 3.7975umX2.4725um

References
Digital Integrated Circuits
Jan Rabaey, Anantha Chandrakasan, Borivoje Nikolic
( SRAM Cell Design, Decoders, Read Write Circuits )
CMOS VLSI Design by Weste and Harris
( Butterfly Curves )
CMOS Circuit Design, Layout and Simulation
Baker, Li, Boyce (Decoder Design)
Course slides of Prof. Kia Bazargan
( Precharge Techniques, Decoders, SRAM Cell Design )

System Diagram for developing LMS Algorithm for Channel Estimation ( H(z) )
Errors e1 and e2 ( e2 being the Quantized Error ) could have the same convergence
If the channel model H(z) is adapted using a LMS Model
Next few slides show regular LMS and modified LMS Error Convergence
Adaptive DSP Course by Prof. Keshab Parhi

Error Convergence for regular LMS takes more time than the modified LMS

Modified LMS Adapts all tap weights using different errors computed using as many
filter output estimates as the filter order. The assumption being that the optimum
gradient direction for each tap weight is different and is given by the corresponding error
Lattice Predictors would be a more efficient way to do this as compared to LMS since
each stage of a predictor is optimum for that order unlike modified LMS where you
adapt each tap weight in a sub optimal manner.

EEG Spectral Estimates for Pre-Ictal, Ictal and Post-Ictal Signal Sequences

Spectral Estimation for a low pass filtered impulse sequence using different techniques

Correlograms provide best Spectral Estimates for Low Pass Filtered Impulse Trains

EE 5364 / CS 5204:
Advanced Computer Architecture
Final Course Project on
Design of a Branch Predictor
Prepared by:
Nirav Desai 4280229
Amanda Skinner 3749048
Course Instructor: Prof. Pen-Chung Yew
Department of ECE
University of Minnesota, Twin Cities

Nirav A. Desai desai.nirav.12.09@gmail.com68Nirav Desai 4280229 ECE
Amanda Skinner 3749048 CS
Why Branch Predictor?
• Branch Predictors improve the flow of
the instruction pipeline
• As Branch predictor accuracy increases,
cache misses decrease, or improve, for
both data and instruction caches

Why Branch Predictor?
Nirav Desai 4280229 ECE

• As branch predictor accuracy increases, cache misses go down
• Prefetching and increasing cache size decreases cache misses
Miss Rate for Mesa benchmark. Both the L1-Data and L2 cache
associativities were changed
Why Prefetching ?
[4]

• LA-PC runs ahead of PC and keeps track of load and store instructions
• RPT keeps track of previous reference addresses and strides for load
and store instructions
• L2 Cache prefetching can be done by storing spill over data and
instructions from L1 Cache blocks.
• INTEL CORE 2 Duo uses RPT for L1 Cache Prefetching and Loop
Counter Local Branch Predictor
Reference Prediction Table[1]

• Loop Counter would give high accuracy on matrix multiplication
• Track all registers for loop counter as possibility of different
interleaved threads using different registers
• Loop Counter error would imply dynamic update of registers
based on non-local values
• Tag registers giving repeated conditional branch errors on the
Branch Decision Table
• Use the O-GEHL predictor for all tagged branches
• Using the loop counter and duplicate ALU will allow indexing
long histories with limited geometric length
Design of Branch Predictor

Branch Decision Table
Branch
Address
Predicted
Direction
Predicted
Branch
Target
Actual
Direction
Actual
Branch
Target
Counters
Used
C(i)(j)
T
a
g
Counters
Used
C(i)(j)
Entered by
LA-PC
Entered by
Loop Counter or
O-GEHL
Entered by
Duplicate
ALU
Entered
by PC
Entered by
PC
Entered
by O-
GEHL
Entered by
O-GEHL
if prediction != actual decision
Prediction computed by Loop Counter ?
Yes - Incorrect Duplicate Register Values
Re-Initialize Duplicate Register Stack
Set LA-PC to PC
After 2 successive errors make an entry in O-GEHL
Also tag the branch address in Branch Decision Table
to be used with O-GEHL
Prediction computed by O-GEHL ?
Yes – Run the update equation on
counters listed in table
Set LA-PC to PC

Loop Counter Branch Predictor
Op-Code = 4 (beq) OR Op-Code = 5 (bne)
Duplicate Register Flag == 0 ?
Yes No
First Conditional Branch
Copy Register Stack to
Duplicate Register Stack
( Equivalent to initializing
the duplicate register stack)
Duplicate Register Stack Initialized
Set Register Flag for rs and rt = 1
These registers will be tracked by the Duplicate ALU
Proceed to Branch Prediction Computation
rs == rt ? rs != rt ?
Op code == 4 ? Op code == 5 ?
yesno yes noExecute
Copy Off-Set from bits 15 to bit 0
Sign Extend Off Set to bit 31 ( Total 32 bits )
Left Shift by 2 ( to get Word Address )
Add to PC+4 to get Branch Target Address
Inc
LA-PC
By 4
Inc
LA-PC
By 4
Do addition and subtraction for all
instructions having rs and rt with
register flags set to 1
rs – Bits 25:21 rt – Bits: 20:16
The loop counter looks at only
the conditional branches
Can be extended to bgtz, blez
Op-Code:
Bits 31:26

O-GEHL Branch Predictor[2]
C12()
C11()
C24()
C23()
C22()
C21()
C39()
C38()
C37()
C36()
C35()
C34()
C33()
C32()
C31()
History Lengths go in Geometric Progression given by L(i) = αi-1
L(1) + constant
Best Series found from experiments: 2, 4, 9, 12, 18, 31, 54, 114, 145, 266
Dynamic History length fitting with variable α also possible.
C10266()
C10265()
C101()
Sum = ΣC(i)(j)+C(i+1)(k)+…C(i+9)(l)
• j,k,l .. Are incremented on every
unconditional branch.
• j increments are modulo 2,
k increments are modulo 4,
l increments are modulo 266.
• Each C(i)(j) is a 4 bit saturating counter
that counts -8 to 7.
• Counter Update given by:
if(p!=out)
if(branch==taken) c(i)(j)++
if(branch!=taken) c(i)(j)--
• Dynamic Threshold (θ) Fitting possible
• Threshold(θ) by default is 0.
Sum > θ then p = taken
Sum < θ then p = not taken

Duplicate ALU ( for MIPS )[3]
LA-PC Address -Instruction
Duplicate
Instruction Queue
Reg 3
Reg 2
Reg 1
Op
Code
31-26
25-21
20-16
15-11
Decode
Unit
Compare
Op-Code
Op-Code == 4 OR 5: (beq, bne) Use Loop Counter
Op-Code == 2 OR 3: (jump, jal) Always take
Op-Code == 0 & FUNCT==8 OR 9: (jr, jalr) Always take
Branch Target for Jump: 32bits: bits 31:28: 4 MSB bits of current PC+4
bits 27:2: Jump Target from instruction
bits 1:0 : 00 ( Word Addresses )
Branch Target for Branch: 32 bits: Current PC + 4 + bits 15:0 left shifted by 2 to give word addresses
Compare Register Flags for reg1, reg2, reg3
If register flags set, do the computation for
Op-Code: 0 bits(5:0) 32: add r1, r2, r3
Op-Code: 0 bits(5:0) 34: sub r1, r2, r3
Op-Code: 0 bits(5:0) 33: addu r1, r2, r3
Op-Code: 0 bits(5:0) 35: subu r1, r2, r3
Op-Code: 8: addi r1, constant
Op-Code: 9: addiu r1, constant
• Set LA-PC Busy bit on instruction read
• When LA-PC updated by branch predictors,
busy bit reset
• For arithmetic, reset busy bit after 2 cycles
• Instruction read when busy bit reset
• LA-PC different from that used in RPT
This branch predictor can be used on Multi Threaded CPUs

Test results on O-GEHL Branch
Predictor[5]

References
1. An Effective On-Chip Preloading Scheme to Reduce Data Access Penalty
Jean-Loup Baer, Tien-Fu Chen
Department of Computer Science and Engineering,
University of Washington, Seattle, WA 98195
Supercomputing '91 Proceedings of the 1991 ACM/IEEE Conference on Supercomputing
2. The O-GEHL Branch Predictor
Andre Seznec
The 1st JILP Championship Branch Prediction Competition CBP1 (2004)
Available from www.jilp.org
3. Computer Organisation and Design
The Hardware-Software Interface
David Patterson and John Hennessy
4. http://en.wikipedia.org/wiki/CPU_cache
5. Analysis of the Optimized GEHL Predictor
Andre Seznec
Available from: http://www.irisa.fr/caps/people/seznec/ISCA05.pdf

Research Ideas I am working on
right now

Strained Silicon on SiGe Solar Cell
• Requires Chemical Vapor Deposition or MBE techniques for fabrication
• Completed a short term course on Semiconductor Technology and Manufacturing at IIT Bombay
to learn about these techniques in November 2012.
• Tandem Solar Cell design gives a wide band of absorbable frequencies with different band gaps.
• Optimal thickness at quarter wavelength will give maximum absorption at designed frequency
• Back plate metal contacts and top plate fingered contacts
• Economically viable for charging battery packs in electric vehicles and for replacing LPG cooking
gas cylinders.
• Long term viability for power generation feasible due to low operating costs and low distribution
costs in a distributed model.
• Reference: Si/multicrystalline-SiGe heterostructure as a candidate for solar cells with high
conversion efficiency:
Photovoltaic Specialists Conference, 2002. Conference Record of the Twenty-Ninth IEEE
Date of Conference: 19-24 May 2002
Author(s): Usami, N.
Inst. for Mater. Res., Tohoku Univ., Sendai, Japan
Takahashi, T. ; Fujiwara, K. ; Ujihara, T. ; Sazaki, G. ; Murakami, Y. ; Nakajima, K.
Page(s): 247 - 249

Rake Receiver with MDS Codes
• Rake receivers could be used to identify strongest multi path component from a received signal.
• This could be achieved by correlating the received signal with itself over different delays and
finding the strongest delay component.
• This does not involve maximal ratio combining.
• It could be combined with MDS codes for wireless communications where given any d bits
corrupted by channel noise or multi path effects, the signal could still be recovered uniquely.
• Reference: Lectures of Prof. Cutter on iTunesU under the course on Digital Communications 2
taught at MIT.
• Reference: W-CDMA Rake Receiver implementation in DSP: EE Times: Link:
http://www.eetimes.com/electronics-news/4139933/W-CDMA-RAKE-Receiver-Comes-to-Life-in-DSP
• Reference: A Rake Receiver for Maximal Ratio Combining without Channel Estimation for UWB
Communications: http://digitalcommons.unf.edu/cgi/viewcontent.cgi?
article=1044&context=ojii_volumes

Class S RF Power Amplifiers on
GaN HEMTs
• Class S RF Power Amplifiers with fully differential H-Bridge topology could give a theoretical
100% efficiency.
• GaN HEMTs give the best high frequency switching characteristics.
• The 2 features could be combined to give a high efficiency RF power amplifier topology.
• Under-graduate project on Class S Audio Amplifier Design
• Reference: Ph.D. Dissertation of Stephan Maroldt, University of Freiburg
• Reference: Device Evaluation for Current Mode Class D RF Power Amplifiers with high output
power and efficiency. Thesis of Thomas Dellsperger
http://www.ece.ucsb.edu/rad/pubs/master/tdellsperger_2003.pdf
• Reference: High linearity and high efficiency Class B RF Power Amplifiers in GaN HEMTs
http://www.ece.ucsb.edu/faculty/rodwell/publications_and_presentations/publications/239.pdf

Microprocessor Design
• The attached slides describe the design of a 16 bit Brent Kung Adder and 1024x16
asynchronous SRAM in 45 nM CMOS along with the design of a branch predictor and cache
prefetch unit for a MIPS microprocessor.
• These design ideas could be combined with other ideas for pipeline design, ALU design and
interconnect circuit design to give a full physical layer design of a MIPS microprocessor in 45nM
CMOS.
• Various power reduction and clock gating techniques could be applied at a higher level of the
hierarchy.
• Clock gating could be done at a coarse level like not clocking a core which is not being used or
at a fine level where the modules not being used are not clocked. In a deeply pipelined design,
the divider need not be clocked if only multiply and accumulate operations are being carried out.
• References for clock gating: Clock Tree Power Optimization based on RTL clock gating:
http://dl.acm.org/citation.cfm?id=775989
• Attended tutorials at the VLSI Design Conference in 2013 to learn more about these techniques.
• Clock gating could be done using higher power FETs.

mm-wave MIMO OFDM
• mm-wave MIMO OFDM could be used for wireless backhaul networks due to its high capacity
• mm-wave MIMO systems could be extended to 2x2, 4x4, 8x8, etc topologies to exploit spatial
diversity and get higher data rate.
• Reference:
• 4 channel spatial multiplexing over a mm-wave line of sight link
Microwave Symposium Digest, 2009. MTT '09. IEEE MTT-S International
Date of Conference: 7-12 June 2009
Author(s): Sheldon, C.
Dept. of Electr. & Comput. Eng., Univ. of California, Santa Barbara, CA, USA
Munkyo Seo ; Torkildson, E. ; Rodwell, M. ; Madhow, U.
Page(s): 389 - 392

Routing algorithm to reduce
congestion
• The routing algorithm to reduce congestion could be based on the idea of sparsity.
• High congestion nodes could be dropped from the network map till congestion on the node
drops.
• The underlying packet streams would be using a flow control based routing protocol.
• Each node would store a map of the network which would be updated periodically using ping
back messages.
• Could be applied to packet switched networks, traffic control and wireless sensor networks.
• Reference: Flow control based routers developed by Anagran.
• Reference: Ad-Hoc On Demand Distance based algorithms treat packets as flows by leaving
backwards pointers to subsequent packets in the chain at each router nodes.
http://www.cs.ucsb.edu/~ebelding/txt/wmcsa99.pdf

Photonic Computers
• These could use multiplexer based logic gates.
• Photonic multiplexers have been widely researched and developed for optical communications.
• Phase detectors could be used to identify the phase and thus the value of the stored signal.
• These would use electronic charge storage and high speed electro-optic conversion.
• Reference: Prior research on this has been carried out in UCSB.

Research presentation

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to Research presentation

Similar to Research presentation (20)

Recently uploaded

Recently uploaded (20)

Research presentation

Editor's Notes