Final Year IEEE Project 2013-2014 - VLSI Project Title and Abstract

43,263 views

Published on

Final Year IEEE Project 2013-2014 - VLSI Project Title and Abstract

Published in: Technology, Business
3 Comments
9 Likes
Statistics
Notes
No Downloads
Views
Total views
43,263
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
490
Comments
3
Likes
9
Embeds 0
No embeds

No notes for slide

Final Year IEEE Project 2013-2014 - VLSI Project Title and Abstract

  1. 1. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com 13 Years of Experience Automated Services 24/7 Help Desk Support Experience & Expertise Developers Advanced Technologies & Tools Legitimate Member of all Journals Having 1,50,000 Successive records in all Languages More than 12 Branches in Tamilnadu, Kerala & Karnataka. Ticketing & Appointment Systems. Individual Care for every Student. Around 250 Developers & 20 Researchers
  2. 2. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com 227-230 Church Road, Anna Nagar, Madurai – 625020. 0452-4390702, 4392702, + 91-9944793398. info@elysiumtechnologies.com, elysiumtechnologies@gmail.com S.P.Towers, No.81 Valluvar Kottam High Road, Nungambakkam, Chennai - 600034. 044-42072702, +91-9600354638, chennai@elysiumtechnologies.com 15, III Floor, SI Towers, Melapudur main Road, Trichy – 620001. 0431-4002234, + 91-9790464324. trichy@elysiumtechnologies.com 577/4, DB Road, RS Puram, Opp to KFC, Coimbatore – 641002 0422- 4377758, +91-9677751577. coimbatore@elysiumtechnologies.com
  3. 3. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Plot No: 4, C Colony, P&T Extension, Perumal puram, Tirunelveli- 627007. 0462-2532104, +919677733255, tirunelveli@elysiumtechnologies.com 1st Floor, A.R.IT Park, Rasi Color Scan Building, Ramanathapuram - 623501. 04567-223225, +919677704922.ramnad@elysiumtechnologies.com 74, 2nd floor, K.V.K Complex,Upstairs Krishna Sweets, Mettur Road, Opp. Bus stand, Erode-638 011. 0424-4030055, +91- 9677748477 erode@elysiumtechnologies.com No: 88, First Floor, S.V.Patel Salai, Pondicherry – 605 001. 0413– 4200640 +91-9677704822 pondy@elysiumtechnologies.com TNHB A-Block, D.no.10, Opp: Hotel Ganesh Near Busstand. Salem – 636007, 0427-4042220, +91-9894444716. salem@elysiumtechnologies.com
  4. 4. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI-001 Pragmatic Integration of an SRAM Row Cache in Heterogeneous 3-D DRAM Architecture Using TSV Abstract: As scaling DRAM cells becomes more challenging and energy-efficient DRAM chips are in high demand, the DRAM industry has started to undertake an alternative approach to address these looming issues-that is, to vertically stack DRAM dies with through-silicon-vias (TSVs) using 3-D-IC technology. Furthermore, this emerging integration technology also makes heterogeneous die stacking in one DRAM package possible. Such a heterogeneous DRAM chip provides a unique, promising opportunity for computer architects to contemplate a new memory hierarchy for future system design. In this paper, we study how to design such a heterogeneous DRAM chip for improving both performance and energy efficiency. In particular, we found that, if we want to design an SRAM row cache in a DRAM chip, simple stacking alone cannot address the majority of traditional SRAM row cache design issues. In this paper, to address these issues, we propose a novel floorplan and several architectural techniques that fully exploit the benefits of 3-D stacking technology. Our multi-core simulation results with memory- intensive applications suggest that, by tightly integrating a small row cache with its corresponding DRAM array, we can improve performance by 30% while saving dynamic energy by 31%. ETPL VLSI-002 A Low-Complexity Turbo Decoder Architecture for Energy-Efficient Wireless Sensor Networks Abstract: Turbo codes have recently been considered for energy-constrained wireless communication applications, since they facilitate a low transmission energy consumption. However, in order to reduce the overall energy consumption, lookup table-log-BCJR (LUT-Log-BCJR) architectures having a low processing energy consumption are required. In this paper, we decompose the LUT-Log-BCJR architecture into its most fundamental add compare select (ACS) operations and perform them using a novel low-complexity ACS unit. We demonstrate that our architecture employs an order of magnitude fewer gates than the most recent LUT-Log-BCJR architectures, facilitating a 71% energy consumption reduction. Compared to state-of-the-art maximum logarithmic Bahl-Cocke-Jelinek-Raviv implementations, our approach facilitates a 10% reduction in the overall energy consumption at ranges above 58 m. ETPL VLSI-003 Pipelined Radix- 2k Feedforward FFT Architectures Abstract: The appearance of radix-22 was a milestone in the design of pipelined FFT hardware architectures. Later, radix-22 was extended to radix-2k . However, radix-2k was only proposed for single- path delay feedback (SDF) architectures, but not for feedforward ones, also called multi-path delay commutator (MDC). This paper presents the radix-2k feedforward (MDC) FFT architectures. In feedforward architectures radix-2k can be used for any number of parallel samples which is a power of two. Furthermore, both decimation in frequency (DIF) and decimation in time (DIT) decompositions can be used. In addition to this, the designs can achieve very high throughputs, which makes them suitable for the most demanding applications. Indeed, the proposed radix-2k feedforward architectures require fewer hardware resources than parallel feedback ones, also called multi-path delay feedback (MDF), when several samples in parallel must be processed. As a result, the proposed radix-2k feedforward architectures not only offer an attractive solution for current applications, but also open up a new research line on feedforward structures.
  5. 5. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI-004 Algorithm and Architecture Design of Bandwidth-Oriented Motion Estimation for Real-Time Mobile Video Applications Abstract: This paper proposes a data bandwidth-oriented motion estimation design for resource-limited mobile video applications using an integrated bandwidth rate distortion optimization framework. This framework predicts and allocates the appropriate data bandwidth for motion estimation under a limited bandwidth supply to fit a dynamically changing bandwidth supply. The simulation results show that our proposed algorithm can achieve 66% and 41% memory bandwidth savings while maintaining an equivalent rate-distortion performance and meeting real-time targets, when compared with conventional approaches for low-motion and high-motion D1 (704 ×  576)-size video, respectively. The final implementation costs 122 K gate counts with TSMC 0.13-μ m CMOS technology and consumes 74 mW of power for D1 resolution at 30 frames/s which is 40% of that achieved in previous designs. ETPL VLSI-005 STBC-OFDM Downlink Baseband Receiver for Mobile WMAN Abstract: This paper proposes a space time block code-orthogonal frequency division multiplexing downlink baseband receiver for mobile wireless metropolitan area network. The proposed baseband receiver applied in the system with two transmit antennas and one receive antenna aims to provide high performance in outdoor mobile environments. It provides a simple and robust synchronizer and an accurate but hardware affordable channel estimator to overcome the challenge of multipath fading channels. The coded bit error rate performance for 16 quadrature amplitude modulation can achieve less than 10-6 under the vehicle speed of 120 km/hr. The proposed baseband receiver designed in 90-nm CMOS technology can support up to 27.32 Mb/s uncoded data transmission under 10 MHz channel bandwidth. It requires a core area of 2.41 × 2.41 mm2 and dissipates 68.48 mW at 78.4 MHz with 1 V power supply. ETPL VLSI-006 Glitch-Free NAND-Based Digitally Controlled Delay-Lines Abstract: The recently proposed NAND-based digitally controlled delay-lines (DCDL) present a glitching problem which may limit their employ in many applications. This paper presents a glitch-free NAND- based DCDL which overcame this limitation by opening the employ of NAND-based DCDLs in a wide range of applications. The proposed NAND-based DCDL maintains the same resolution and minimum delay of previously proposed NAND-based DCDL. The theoretical demonstration of the glitch-free operation of proposed DCDL is also derived in the paper. Following this analysis, three driving circuits for the delay control-bits are also proposed. Proposed DCDLs have been designed in a 90-nm CMOS technology and compared, in this technology, to the state-of-the-art. Simulation results show that novel circuits result in the lowest resolution, with a little worsening of the minimum delay with respect to the previously proposed DCDL with the lowest delay. Simulations also confirm the correctness of developed glitching model and sizing strategy. As example application, proposed DCDL is used to realize an All- digital spread-spectrum clock generator (SSCG). The employ of proposed DCDL in this circuit allows to reduce the peak-to-peak absolute output jitter of more than the 40% with respect to a SSCG using three- state inverter based DCDLs.
  6. 6. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI-007 A High-Efficiency, Wide Workload Range, Digital Off-Time Modulation (DOTM) DC- DC Converter With Asynchronous Power Saving Technique Abstract: Conventionally for wide workload range applications, to keep good stability and high efficiency, a switching converter with multi-mode operation is necessary. With the advanced digital signal processing, this work presents an asynchronous digital controller with dynamic power saving technique to achieve high power efficiency. The regulation is based on the off-time modulation, in which an adaptive resolution adjustment is proposed for the extension toward light-loaded range. The DC-DC converter is fabricated in a 0.18- μm CMOS process. The input voltage is from 2.7 to 3.6 V and the regulated output is 1.8 V. The switching frequency is from 44 kHz to 1.65 MHz and the maximum output ripple is 20 mV with a 10-μF capacitor and a 2.2-μH inductor. The power efficiency is higher than 91% for the workload range from 3 to 400 mA. ETPL VLSI-008 Formal Verification of Architectural Power Intent Abstract: This paper presents a verification framework that attempts to bridge the disconnect between high-level properties capturing the architectural power management strategy and the implementation of the power management control logic using low-level per-domain control signals. The novelty of the proposed framework is in demonstrating that the architectural power intent properties developed using high-level artifacts can be automatically translated into properties over low-level control sequences gleaned from UPF specifications of power domains, and that the resulting properties can be used to formally verify the global on-chip power management logic. The proposed translation uses a considerable amount of domain knowledge and is also not purely syntactic, because it requires formal extraction of timing information for the low-level control sequences. We present a tool, called POWER-TRUCTOR which enables the proposed framework, and several test cases of significant complexity to demonstrate the feasibility of the proposed framework. ETPL VLSI-009 Statistical SRAM Read Access Yield Improvement Using Negative Capacitance Circuits Abstract: SRAM has become the dominant block in modern ICs and constitutes more than 50% of the die area. The increase of process variations with continued CMOS technology scaling is considered one of the major challenges for SRAM designers. This process variations increase causes the SRAM cells to functionally fail and reduces the chip functional yield considering the static noise margin stability failures (i.e., cell flips when accessed), write failures (i.e., cell is not written within the write window), and read access failures (i.e., incorrect read operation). In this paper, novel negative capacitance circuits are developed, for the first time, to statistically improve the SRAM read access yield under process variations by reducing the bitlines parasitic capacitance. Post layout simulation results, referring to an industrial hardware-calibrated TSMC 65-nm CMOS technology, show that the adoption of the negative capacitance circuit to a 512 SRAM cells column is capable of improving the read access yield from 61.9% to 100%. ETPL VLSI-010 An Energy-Efficient L2 Cache Architecture Using Way Tag Information Under Write- Through Policy Abstract: Many high-performance microprocessors employ cache write-through policy for performance improvement and at the same time achieving good tolerance to soft errors in on-chip caches. However,
  7. 7. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com write-through policy also incurs large energy overhead due to the increased accesses to caches at the lower level (e.g., L2 caches) during write operations. In this paper, we propose a new cache architecture referred to as way-tagged cache to improve the energy efficiency of write-through caches. By maintaining the way tags of L2 cache in the L1 cache during read operations, the proposed technique enables L2 cache to work in an equivalent direct-mapping manner during write hits, which account for the majority of L2 cache accesses. This leads to significant energy reduction without performance degradation. Simulation results on the SPEC CPU2000 benchmarks demonstrate that the proposed technique achieves 65.4% energy savings in L2 caches on average with only 0.02% area overhead and no performance degradation. Similar results are also obtained under different L1 and L2 cache configurations. Furthermore, the idea of way tagging can be applied to existing low-power cache design techniques to further improve energy efficiency. ETPL VLSI-011 An Analytical Latency Model for Networks-on-Chip Abstract: We propose an analytical model based on queueing theory for delay analysis in a wormhole- switched network-on-chip (NoC). The proposed model takes as input an application communication graph, a topology graph, a mapping vector, and a routing matrix, and estimates average packet latency and router blocking time. It works for arbitrary network topology with deterministic routing under arbitrary traffic patterns. This model can estimate per-flow average latency accurately and quickly, thus enabling fast design space exploration of various design parameters in NoC designs. Experimental results show that the proposed analytical model can predict the average packet latency more than four orders of magnitude faster than an accurate simulation, while the computation error is less than 10% in non- saturated networks for different system-on-chip platforms. ETPL VLSI-012 Built-In Generation of Functional Broadside Tests Using a Fixed Hardware Structure Abstract: Functional broadside tests are two-pattern scan-based tests that avoid overtesting by ensuring that a circuit traverses only reachable states during the functional clock cycles of a test. In addition, the power dissipation during the fast functional clock cycles of functional broadside tests does not exceed that possible during functional operation. On-chip test generation has the added advantage that it reduces test data volume and facilitates at-speed test application. This paper shows that on-chip generation of functional broadside tests can be done using a simple and fixed hardware structure, with a small number of parameters that need to be tailored to a given circuit, and can achieve high transition fault coverage for testable circuits. With the proposed on-chip test generation method, the circuit is used for generating reachable states during test application. This alleviates the need to compute reachable states offline. ETPL VLSI-013 Checkpointing for Virtual Platforms and SystemC-TLM Abstract: Integrating simulation models created using different simulation systems is a common problem when constructing virtual platforms. Different companies and different departments can create models, and virtual platforms for different purposes using different tools. There are also existing models that need to be integrated into new tools, or the other way around. The simulators can be quite different in details, even in the case of transaction-level models. We present work in integrating SystemC transaction-level models into two typical full-system simulation environments, QEMU and Simics. We present issues in
  8. 8. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com reconciling the semantics of the different platforms, and our proposed solutions. In the Simics integration, we additionally enable checkpointing in the models, based on the Simics checkpoint mechanism. ETPL VLSI-014 Design of a Practical Nanometer-Scale Redundant Via-Aware Standard Cell Library for Improved Redundant Via1 Insertion Rate Abstract: Despite the rapid advances in process technology, via failure is still problematic in nanometer- scale semiconductor manufacturing. Adding redundant vias is a typical approach for improving yield and reliability. Cell-based design methodologies are widely adopted in the industry for application-specific integrated circuits. Standard cells are effective for increasing the insertion rate of redundant via1s in cell- based designs. This study proposes an efficient library check and staggered pin arrangement approach that compares redundant via1 insertion rate in different configurations such as double-via and rectangle-via. To compare the variability in standard cell (SC) libraries, accurate characterization results are provided. Moreover, the proposed SC library is easily implemented in all currently available routers. The experimental results reveal that the proposed library improves total inserted redundant vias, total inserted redundant via1s, and total run time by 20.2%, 51.9%, and 42.3%, respectively. In double-via pattern, the proposed approach improves average via1 insertion rate by 14.6%. In rectangle-via pattern, the proposed approach achieves a 100% via1 insertion rate. ETPL VLSI-015 Scaling Energy Per Operation via an Asynchronous Pipeline Abstract: Statistical analysis of computations per unit energy in processors over the last 30 years is given that illustrates a sharp reduction in the rate of energy efficiency improvements over the last several years resulting in the formation of an asymptotic “wall” with our dataset; we use the measure of giga multiply accumulates per Joule. We have developed an energy model which takes into account the realities of scaling, specifically for asynchronous systems. Studies of an energy efficient asynchronous pipeline show fabricated results of 17 Giga Operations per Joule in 0.6 μm at subthreshold when fully pipelined, and simulations at a more modern 65 nm process show a further order of magnitude improvement on that. ETPL VLSI-016 A High Speed Low Power CAM With a Parity Bit and Power-Gated ML Sensing Abstract: Content addressable memory (CAM) offers high-speed search function in a single clock cycle. Due to its parallel match-line (ML) comparison, CAM is power-hungry. Thus, robust, high-speed and low-power ML sense amplifiers are highly sought-after in CAM designs. In this paper, we introduce a parity bit that leads to 39% sensing delay reduction at a cost of less than 1% area and power overhead. Furthermore, we propose an effective gated-power technique to reduce the peak and average power consumption and enhance the robustness of the design against process variations. A feedback loop is employed to auto-turn off the power supply to the comparison elements and hence reduce the average power consumption by 64%. The proposed design can work at a supply voltage down to 0.5 V. ETPL VLSI-017 Error Detection in Majority Logic Decoding of Euclidean Geometry Low Density Parity Check (EG-LDPC) Codes Abstract: In a recent paper, a method was proposed to accelerate the majority logic decoding of difference set low density parity check codes. This is useful as majority logic decoding can be implemented serially
  9. 9. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com with simple hardware but requires a large decoding time. For memory applications, this increases the memory access time. The method detects whether a word has errors in the first iterations of majority logic decoding, and when there are no errors the decoding ends without completing the rest of the iterations. Since most words in a memory will be error-free, the average decoding time is greatly reduced. In this brief, we study the application of a similar technique to a class of Euclidean geometry low density parity check (EG-LDPC) codes that are one step majority logic decodable. The results obtained show that the method is also effective for EG-LDPC codes. Extensive simulation results are given to accurately estimate the probability of error detection for different code sizes and numbers of errors. ETPL VLSI-018 Techniques for Compensating Memory Errors in JPEG2000 Abstract: This paper presents novel techniques to mitigate the effects of SRAM memory failures caused by low voltage operation in JPEG2000 implementations. We investigate error control coding schemes, specifically single error correction double error detection code based schemes, and propose an unequal error protection scheme tailored for JPEG2000 that reduces memory overhead with minimal effect in performance. Furthermore, we propose algorithm-specific techniques that exploit the characteristics of the discrete wavelet transform coefficients to identify and remove SRAM errors. These techniques do not require any additional memory, have low circuit overhead, and more importantly, reduce the memory power consumption significantly with only a small reduction in image quality. ETPL VLSI-019 Spatial Distribution Measurement of Dynamic Voltage Drop Caused by Pulse and Periodic Injection of Spot Noise Abstract: This paper presents measured results of dynamic voltage drop caused by pulse and periodic injection of spot noise. The test structure being fabricated by a 45 nm low-power process has 1024 delay probes to measure spatial distributions in response to the spot-noise generation. The test structure is the advanced version of our predecessor being fabricated by a 65-nm node, and can trace changes in the spatial distributions with time after the noise injection. The measured results are compared with SPICE simulations, in which package/socket LCR as well as power-line RC within the die is modeled. It is found that the simple model agrees well with the measured results. ETPL VLSI-020 Low-Complexity Multiplier for GF(2^{m}) Based on All-One Polynomials Abstract: This paper presents an area-time-efficient systolic structure for multiplication over GF(2m) based on irreducible all-one polynomial (AOP). We have used a novel cut-set retiming to reduce the duration of the critical-path to one XOR gate delay. It is further shown that the systolic structure can be decomposed into two or more parallel systolic branches, where the pair of parallel systolic branches has the same input operand, and they can share the same input operand registers. From the application- specific integrated circuit and field-programmable gate array synthesis results we find that the proposed design provides significantly less area-delay and power-delay complexities over the best of the existing designs. ETPL VLSI-021 Design and Implementation of an On-Chip Permutation Network for Multiprocessor System-On-Chip
  10. 10. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Abstract: This paper presents the silicon-proven design of a novel on-chip network to support guaranteed traffic permutation in multiprocessor system-on-chip applications. The proposed network employs a pipelined circuit-switching approach combined with a dynamic path-setup scheme under a multistage network topology. The dynamic path-setup scheme enables runtime path arrangement for arbitrary traffic permutations. The circuit-switching approach offers a guarantee of permuted data and its compact overhead enables the benefit of stacking multiple networks. A 0.13-μ m CMOS test-chip validates the feasibility and efficiency of the proposed design. Experimental results show that the proposed on-chip network achieves 1.9× to 8.2× reduction of silicon overhead compared to other design approaches. ETPL VLSI-022 An On-Chip Network Fabric Supporting Coarse-Grained Processor Array Abstract: Coarse grained arrays (CGAs) with run-time reconfigurability play an important role in accelerating reconfigurable computing applications. It is challenging to design on-chip communication networks (OCNs) for such CGAs with dynamic run-time reconfigurability whilst satisfying the tight budgets of power and area for an embedded system. This paper presents a silicon-proven design of a 64- PE circuit-switched OCN fabric with a dynamic path-setup scheme capable of supporting an embedded coarse-grained processor array. A proof-of-concept test chip fabricated in a 0.13 μm CMOS process occupies a silicon area of 23 mm2 and consumes a peak power of 200 mW @ 128 MHz and 1.2 Vcc, at room temperature. The OCN overhead consumes 9.4% of the area and 18% of the power of the total chip. Experimental results and analysis show that the proposed OCN fabric with its dynamic path-setup is suitable for use in an embedded CGA supporting fast run-time reconfigurability. ETPL VLSI-023 A Very Linear Low-Pass Filter with Automatic Frequency Tuning Abstract: A Gm-C third-order Chebyshev low-pass filter with a novel switched capacitor frequency tuning technique for a zero-IF Bluetooth receiver has been designed. The frequency tuning scheme is simpler and has more relaxed specifications than conventional ones. Furthermore, a highly linear pseudo- differential transconductor with a compact feedback loop able to operate with low supply voltage has been used. This control loop holds the input transistors in triode region and provides high output resistance, keeping high linearity in a wide range of transconductance. The filter bandwidth is 0.5 MHz and the overall scheme consumes 1.1 mA from a 1.8-V supply. The measured third-order intermodulation (IM3) distortion of the filter for a 1 Vpp two-tone signal centered at 300 kHz is -65 dB. ETPL VLSI-024 A High-Speed Low-Complexity Modified {rm Radix}-2^{5} FFT Processor for High Rate WPAN Applications Abstract: This paper presents a high-speed low-complexity modified radix-25 512-point fast Fourier transform (FFT) processor using an eight data-path pipelined approach for high rate wireless personal area network applications. A novel modified radix-25 FFT algorithm that reduces the hardware complexity is proposed. This method can reduce the number of complex multiplications and the size of the twiddle factor memory. It also uses a complex constant multiplier instead of a complex Booth multiplier. The proposed FFT processor achieves a signal-to-quantization noise ratio of 35 dB at 12 bit internal word length. The proposed processor has been designed and implemented using 90-nm CMOS technology with a supply voltage of 1.2 V. The results demonstrate that the total gate count of the
  11. 11. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com proposed FFT processor is 290 K. Furthermore, the highest throughput rate is up to 2.5 GS/s at 310 MHz while requiring much less hardware complexity. ETPL VLSI-025 Application Space Exploration of a Heterogeneous Run-Time Configurable Digital Signal Processor Abstract: This paper describes the application space exploration of a heterogeneous digital signal processor with dynamic reconfiguration capabilities. The device is built around three reconfigurable engines featuring different flavours and computation granularities that make it suitable for a wide range of signal processing application domains such as video coding, image processing, telecommunications, and cryptography. Performance of signal processing applications is evaluated from measurements performed on a CMOS 90 nm prototype. In order to characterize the application space of the processor, performance is compared with state-of-the-art devices, taking programmability, computational capabilities, and energy efficiency as the main metrics. The device exploits performance and energy efficiency significantly more than general purpose processors, while still maintaining a user-friendly programming approach that mainly relies on software-oriented languages. The device is able to achieve 1.2 to 15 GOPS with an energy efficiency from 2 to 50 GOPS/W when running the selected applications ETPL VLSI-026 A Unified Graphics and Vision Processor With a 0.89 mu W/fps Pose Estimation Engine for Augmented Reality Abstract: A unified vision and graphics processor with three layers is shown to provide a fast pipeline for augmented reality. In the image-level layer, a 153.6 GOPS massively parallel processing unit with eight SIMD processors, each containing 128 processing elements, performs highly data-parallel operations. In the sub-image layer, a rasterizer and a pixel arranger respectively generate and reduce data-level parallelism. In the descriptor-level layer, a pose estimation engine executes sequential programs. Our processor can provide images for augmented reality at 100 fps, for a power consumption of 413 mW. This is 39% faster than a comparable smartphone implementation. Our chip is fabricated in a 0.18 μm CMOS process and contains 0.95 M gates. ETPL VLSI-027 CORDIC Designs for Fixed Angle of Rotation Abstract: Rotation of vectors through fixed and known angles has wide applications in robotics, digital signal processing, graphics, games, and animation. But, we do not find any optimized coordinate rotation digital computer (CORDIC) design for vector-rotation through specific angles. Therefore, in this paper, we present optimization schemes and CORDIC circuits for fixed and known rotations with different levels of accuracy. For reducing the area- and time-complexities, we have proposed a hardwired pre- shifting scheme in barrel-shifters of the proposed circuits. Two dedicated CORDIC cells are proposed for the fixed-angle rotations. In one of those cells, micro-rotations and scaling are interleaved, and in the other they are implemented in two separate stages. Pipelined schemes are suggested further for cascading dedicated single-rotation units and bi-rotation CORDIC units for high-throughput and reduced latency implementations. We have obtained the optimized set of micro-rotations for fixed and known angles. The optimized scale-factors are also derived and dedicated shift-add circuits are designed to implement the scaling. The fixed-point mean-squared-error of the proposed CORDIC circuit is analyzed statistically, and strategies for reducing the error are given. We have synthesized the proposed CORDIC cells by Synopsys Design Compiler using TSMC 90-nm library, and shown that the proposed designs offer higher
  12. 12. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com throughput, less latency and less area-delay product than the reference CORDIC design for fixed and known angles of rotation. We find similar results of synthesis for different Xilinx field-programmable gate-array platforms. ETPL VLSI-028 Application-Driven End-to-End Traffic Predictions for Low Power NoC Design Abstract: As chip multiprocessors keep increasing the number of cores on the chip, the network-on-chip (NoC) technology is becoming essential for interconnecting the cores. While NoCs result in noticeable performance boost over conventional bus systems, they consume a non-negligible fraction of the system power. One promising solution is to dynamically adjust the working frequencies/voltages of the switches as well as the links between switches in the NoC to match the traffic flows. The question is when to adjust and by how much. Most previous works take a passive approach by reacting to fluctuations in local traffic flows. Unfortunately, this approach may be too slow and too conservative in adjusting the working frequencies/voltages. Since applications often exhibit periodic behaviors, we propose a hardware mechanism to proactively adjust the frequencies/voltages of switches and/or links in NoC by predicting the application runtime traffic. The evaluations show that our design achieves 86% dynamic power savings of the links in the on-chip network, and the resulting overheads from mispredictions are tolerable. ETPL VLSI-029 Thermal-Constrained Task Allocation for Interconnect Energy Reduction in 3-D Homogeneous MPSoCs Abstract: 3-D technology that stacks silicon dies with through silicon vias (TSVs) is a promising solution to overcome the interconnect scaling problem in giga-scale integrated circuits (ICs). Thermal dissipation is a major challenge for 3-D integration and prior thermal-balanced task scheduling methods for 3-D multiprocessor system-on-chips (MPSoCs) typically balance power gradient across vertical stacks based on the assumption of strong thermal correlation among processing cores within a stack. On the other hand, 3-D MPSoCs typically employ network-on-chip (NoC) as the communication infrastructure which consumes a large portion of the energy budget. As TSVs consume much less energy than horizontal links in 3-D MPSoCs when transmitting the same amount data due to the reduced interconnect distance between vertical adjacent cores, it motivates to allocate heavily communicating tasks within the same vertical stack as much as possible, and thus traffic is restricted in the third dimension to reduce interconnect energy. However, aggregating active tasks within the same stack probably exacerbates the power density and result in hot spots. In this paper, we explore the tradeoff between thermal and interconnect energy when allocating tasks in 3-D Homogeneous MPSoCs, and propose an efficient heuristic. Experimental results show that the proposed technique can reduce interconnect energy by more than 25% on average with almost the same peak temperature when compared with prior thermal-balanced solutions. ETPL VLSI-030 A Wide-Range PLL Using Self-Healing Prescaler/VCO in 65-nm CMOS Abstract: The variability and leakage current in nanoscale CMOS technology may degrade the circuit performances significantly. To accommodate the above issues in a wide-range phase-locked loop (PLL), a self-healing prescaler, a self-healing voltage-controlled oscillator (VCO), and a calibrated charge pump (CP) are presented. This PLL is fabricated in a 65-nm CMOS technology and its active area is 0.0182 mm2 . For the self-healing VCO, its measured frequency range is from 60 to 1489 MHz. When this PLL
  13. 13. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com operates at 855 MHz, the measured rms and peak-to-peak jitters are 8.03 and 55.6 ps, respectively. The measured reference spur is -52.89 dBc. This PLL consumes 4.3 mW from 1.2 V supply without buffers. ETPL VLSI-031 A Clock Control Strategy for Peak Power and RMS Current Reduction Using Path Clustering Abstract: Peak power reduction has been a critical challenge in the design of integrated circuits impacting the chip's performance and reliability. The reduction of peak power also reduces the power density of integrated circuits. Due to large IR-voltage drops in circuits, transistor switching slows down giving rise to timing violations and logic failures. In this paper, we present a new clock control strategy for peak- power reduction in VLSI circuits. In the proposed method, the simultaneous switching of combinational paths is minimized by taking advantage of the delay slacks among the paths and clustering the paths with similar slack values. Once the paths are identified based on the path delays and their slack values, the clustering algorithm determines the ideal number of clusters for the given circuit and for each cluster the maximum possible phase shift that can be applied to the clock. The paths are assigned to clusters in a load balanced manner based on the slack values and each cluster will have a phase shift possible on its clock depending on the slack. Thus, the proposed register-transfer level (RTL) method takes advantage of the logic-path timing slack to re-schedule circuit activities at optimal intervals within the unaltered clock period. When switching activities are redistributed more evenly across the clock period, the IC supply- current consumption is also spread across a wider range of time within the clock period. This has the beneficial effect of reducing peak-current draw in addition to reducing RMS power draw without having to change the operating frequency and without utilizing additional power supply voltages as in dual or multi VT approaches. The proposed method is implemented and tested through simulations using an experimental setup with Synopsys Tools Suite and Cadence Tools on the ISCAS'85 benchmark circuits, OpenCore circuits and LEON processor multiplier circuit. Experimental results indicate that peak power can be reduced significantly to at- least 72% depending on the number of clusters and the phase-shifted clock identified as suitable for the given circuit by the proposed algorithms. Although the proposed method incurs some power overhead compared to the traditional clocking method, the overhead can be made negligible compared to the peak-power reduction as seen in the experimental results presented. ETPL VLSI-032 A Fast-Locking All-Digital Deskew Buffer With Duty-Cycle Correction Abstract: In this paper, a fast-locking all-digital deskew buffer with duty cycle correction is proposed and implemented. A cyclic time-to-digital converter is introduced to decrease the locking time in conventional register-controlled delay-locked loop to only two input clock cycles in coarse tuning. With the aid of the three half delay lines technique, the mismatch between half delay lines causing the duty cycle distortion can be alleviated by interpolation. A balanced edge combiner to achieve a precise 50% output clock is also presented. A test chip is fabricated in 0.18-μm technology to demonstrate the feasibility of the proposed architecture. The circuit can accept the input clock rates from 250 to 625 MHz with the duty cycle variation within 30% and 70% to generate 50% output clocks. It preserves the capability of closed- loop control with a small area and power consumption. ETPL VLSI-033 A Built-In Repair Analyzer With Optimal Repair Rate for Word-Oriented Memories
  14. 14. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Abstract: This paper presents a built-in self repair analyzer with the optimal repair rate for memory arrays with redundancy. The proposed method requires only a single test, even in the worst case. By performing the must-repair analysis on the fly during the test, it selectively stores fault addresses, and the final analysis to find a solution is performed on the stored fault addresses. To enumerate all possible solutions, existing techniques use depth first search using a stack and a finite-state machine. Instead, we propose a new algorithm and its combinational circuit implementation. Since our formulation for the circuit allows us to use the parallel prefix algorithm, it can be configured in various ways to meet area and test time requirements. The total area of our infrastructure is dominated by the number of content addressable memory entries to store the fault addresses, and it only grows quadratically with respect to the number of repair elements. The infrastructure is also extended to support various types of word-oriented memories. ETPL VLSI-034 System-Level Modeling and Analysis of Thermal Effects in Optical Networks-on-Chip Abstract: The performance of multiprocessor systems, such as chip multiprocessors (CMPs), is determined not only by individual processor performance, but also by how efficiently the processors collaborate with one another. It is the communication architecture that determines the collaboration efficiency on the hardware side. Optical networks-on-chip (ONoCs) are emerging communication architectures that can potentially offer ultra-high communication bandwidth and low latency to multiprocessor systems. Thermal sensitivity is an intrinsic characteristic of photonic devices used by ONoCs as well as a potential issue. This paper systematically modeled and quantitatively analyzed the thermal effects in ONoCs. We used an 8 × 8 mesh-based ONoC as a case study and evaluated the impacts of thermal effects in the average power efficiency for real MPSoC applications. We revealed three important factors regarding ONoC power efficiency under temperature variations, and proposed several techniques to reduce the temperature sensitivity of ONoCs. These techniques include the optimal initial setting of microresonator resonant wavelength, increasing the 3-dB bandwidth of optical switching elements by parallel coupling multiple microresonators, and the use of passive-routing optical router Crux to minimize the number of switching stages in mesh-based ONoCs. We gave a mathematical analysis of periodically parallel coupling of multiple microresonators and show that the 3-dB bandwidth of optical switching elements can be widened nearly linearly with the ring number. Evaluation results for different real MPSoC applications show that, on the basis of thermal tuning, the optimal device setting improves the average power efficiency by 54% to 1.2 pJ/bit when chip temperature reaches 85 °C. The findings in this paper can help support the further development of this emerging technology. ETPL VLSI-035 A Study of Tapered 3-D TSVs for Power and Thermal Integrity Abstract: 3-D integration presents a path to higher performance, greater density, increased functionality and heterogeneous technology implementation. However, 3-D integration introduces many challenges for power and thermal integrity due to large switching currents, longer power delivery paths, and increased parasitics compared to 2-D integration. In this work, we provide an in-depth study of power and thermal issues while incorporating the physical design characteristics unique to 3-D integration. We provide a qualitative perspective of the power and thermal dissipation issues in 3-D and study the impact of Through Silicon Vias (TSVs) size for their mitigation. We investigate and discuss the design implications of power and thermal issues in the presence of decoupling capacitors, TSV/on-die/package parasitics, various resonance effects and power gating. Our study is based on a ten-tier system utilizing existing 3-D
  15. 15. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com technology specifications. Based on detailed power distribution and heat dissipation models, we present a comprehensive analysis of TSV tapering for alleviating power and thermal integrity issues in 3-D ICs. ETPL VLSI-036 Improved Trace Buffer Observation via Selective Data Capture Using 2-D Compaction for Post-Silicon Debug Abstract: This paper presents a novel technique for extending the capacity of trace buffers when capturing debug data during post-silicon debug. It exploits the fact that is it not necessary to capture error-free data in the trace buffer since that information can be obtained from simulation. A selective data capture method is proposed in this paper that only captures debug data during clock cycles in which errors are present. The proposed debug method requires only three debug sessions. The first session estimates a rough error rate, the second session identifies a set of suspect clock cycles where errors may be present, and the third session captures the suspect clock cycles in the trace buffer. The suspect clock cycles are determined through a 2-D compaction technique using multiple-input signature register signatures and cycling register signatures. Intersecting both signatures generates a small number of suspect clock cycles for which the trace buffer needs to capture. The effective observation window of the trace buffer can be expanded significantly, by up to orders of magnitude. Experimental results indicate very significant increases in the effective observation window for a trace buffer can be obtained. ETPL VLSI-037 AC-Plus Scan Methodology for Small Delay Testing and Characterization Abstract: Small delay defects escaping traditional delay testing could cause a device to malfunction in the field and thus detecting these defects is often necessary. To address this issue, we propose three test modes in a new methodology called AC-plus scan, in which versatile test clocks can be generated on the chip by embedding an all-digital phase-locked loop (ADPLL) into the circuit under test (CUT). AC-plus scan can be executed on an in-house wireless test platform called HOY system. The first test mode of our AC-plus scan provides a more efficient way to measure the longest path delay associated with each test pattern. Experimental result shows that our method could greatly reduce the test time by 81.8%. The second test mode is designed for volume production test. It could effectively detect small delay defects and provide fast characterization on those defective chips for further processing. This mode could be used to help predict which chips are more likely to fall victim to operational failure in the field. The third test mode is to extract the waveform of each flip-flop's output in a real chip. This is made possible by taking advantage of the almost unlimited test memory our HOY test platform provides, so that we could easily store a great volume of data and reconstruct the waveform for post-silicon debugging. We have successfully fabricated a Viterbi decoder chip with such an AC-plus scan methodology inside to demonstrate its capability. ETPL VLSI-038 A Variation Tolerant Current-Mode Signaling Scheme for On-Chip Interconnects Abstract: Current-mode signaling (CMS) with dynamic overdriving is one of the most promising scheme for high-speed low-power communication over long on-chip interconnects. However, they are sensitive to parameter variations due to reduced voltage swings on the line. In this paper, we propose a variation tolerant dynamic overdriving CMS scheme. The proposed CMS scheme and a competing CMS scheme (CMS-Fb) are fabricated in 180-nm CMOS technology. Measurement results show that the proposed scheme offers 34% reduction in energy/bit and 42% reduction in energy-delay-product over CMS-Fb
  16. 16. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com scheme for a 10 mm line operating at 0.64 Gbps of data rate. Simulations indicate that the proposed CMS scheme consumes 0.297 pJ/bit for data transfer over the 10 mm line at 2.63 Gb/s. Measurements indicate that the delay of CMS-Fb becomes 2.5 times its nominal value in the presence of intra-die variations whereas the delay of the proposed scheme changes by only 5% for the same amount of intra-die variations. Measurement and simulation results show that both the schemes are robust against inter-die variations. Experiments and simulations also indicate that the proposed CMS scheme is more robust against practical variations in supply and temperature as compared to CMS-Fb scheme. ETPL VLSI-039 Modeling and Analysis of Power Distribution Networks in 3-D ICs Abstract: This paper addresses the modeling and analysis problems for power distribution networks (PDNs) in 3-D ICs. An on-chip distributed model is proposed for 3-D power grids, in which the details of metal layers are considered. The distributed model is demonstrated to be essential to identifying the unique noise behavior of 3-D PDNs. A lumped model is proposed based on the distributed model. The lumped model features the connection impedance between tiers and is proven to be useful for designers to understand the global effects of 3-D PDNs. Based on the models, an analysis flow is designed for 3-D PDNs in both frequency domain and time domain. With the analysis flow, the electrical characteristics of 3-D PDNs are studied systematically for the first time. The frequency-domain analysis identifies the global and local resonance phenomena in 3-D PDNs that are distinct from those in 2-D PDNs. The physical mechanisms behind the resonance phenomena are investigated. The time-domain analysis predicts the worst-case supply noise based on distributed current constraints. The “Rogue Wave” concept is introduced to explain the spatial and temporal relations of the worst-case on-chip noise responses in 3- D PDNs. ETPL VLSI-040 A Low-Cost, Systematic Methodology for Soft Error Robustness of Logic Circuits Abstract: Due to current technology scaling trends such as shrinking feature sizes and decreasing supply voltages, circuit reliability is becoming more susceptible to radiation-induced transient faults (soft errors). Soft errors, which have been a great concern in memories, are now a main factor in reliability degradation of logic circuits as well. In this paper, we present a systematic and integrated methodology for circuit robustness to soft errors. The proposed soft error rate (SER) reduction framework, based on redundancy addition and removal (RAR), aims at eliminating those gates with large contribution to the overall SER. Several metrics and constraints are introduced to guide the RAR-based approach toward SER reduction. Furthermore, we integrate a resizing strategy into our framework, as post-RAR additive SER optimization. The strategy can identify most critical gates to be upsized and thereby, minimize area and power overheads while maintaining a high level of soft error robustness. Experimental results show that the proposed RAR-based framework can achieve up to 70% reduction in output failure probability. On average, about 23% SER reduction is obtained with less than 4% area overhead. ETPL VLSI-041 Low Complexity Out-of-Order Issue Logic Using Static Circuits Abstract: In this paper a single-cycle issue queue circuit architecture that simplifies the wakeup and selection logic is proposed. The micro-architecture and fully static CMOS circuits are presented for a 32- entry queue that issues four instructions per cycle. The instruction-ready signals are divided into groups and processed in parallel to issue the four oldest ready instructions. The complete issue queue and prioritization logic requires 20 inversions, allowing simulated circuit operation at over 4 GHz in a foundry
  17. 17. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com 45 nm SOI fabrication process. ETPL VLSI-042 Low Latency Systolic Montgomery Multiplier for Finite Field GF(2^{m}) Based on Pentanomials Abstract: In this paper, we present a low latency systolic Montgomery multiplier over GF(2m) based on irreducible pentanomials. An efficient algorithm is presented to decompose the multiplication into a number of independent units to facilitate parallel processing. Besides, a novel so-called “pre-computed addition” technique is introduced to further reduce the latency. The proposed design involves significantly less area-delay and power-delay complexities compared with the best of the existing designs. It has the same or shorter critical-path and involves nearly one-fourth of the latency of the other in case of the National Institute of Standards and Technology recommended irreducible pentanomials. ETPL VLSI-043 Power-Up Sequence Control for MTCMOS Designs Abstract: Power gating is effective for reducing standby leakage power as multi-threshold CMOS (MTCMOS) designs have become popular in the industry. However, a large inrush current and dynamic IR drop may occur when a circuit domain is powered up with MTCMOS switches. This could in turn lead to improper circuit operation. We propose a novel framework for generating a proper power-up sequence of the switches to control the inrush current of a power-gated domain while minimizing the power-up time and reducing the dynamic IR drop of the active domains. We also propose a configurable domino- delay circuit for implementing the sequence. Experimental results based on state-of-the-art industrial designs demonstrate the effectiveness of the proposed framework in limiting the inrush current, minimizing the power-up time, and reducing the dynamic IR drop. Results further confirm the efficiency of the framework in handling large-scale designs with more than 80 K power switches and 100 M transistors. ETPL VLSI-044 Architecture and Design Flow for a Highly Efficient Structured ASIC Abstract: As fabrication process technology continues to advance, mask set costs have become prohibitively expensive. Structured application specific integrated circuits (sASICs) offer a middle ground in price and performance between ASICs and field-programmable gate arrays (FPGAs) by sharing masks across different designs. In this paper, two sASIC architectures are proposed, the first being based on three-input lookup-tables, and the second on AOI22 gates. The sASICs are programmed using a standard- cell compatible design flow. They are customized using a minimum of three masks, i.e., two metals and one via. The area and delay of the sASIC are compared with ASICs and FPGAs. Results over a set of benchmark circuits show that our AOI22-based sASIC had an average of 1.76x/1.41x increase in area/delay compared to ASICs, a considerable improvement compared with the 26.56x/5.09x increase for FPGAs. This is, to the best of our knowledge, the best performance reported in the literature for a practical sASIC. A prototype using the sASIC was fabricated using a universal machine control 0.13-μm mixed-mode/RF process. It was fully verified using scan and functional tests, and used in a demonstration system. ETPL VLSI-045 Secure Dual-Core Cryptoprocessor for Pairings Over Barreto-Naehrig Curves on FPGA Platform, Abstract: This paper is devoted to the design and the physical security of a parallel dual-core flexible cryptoprocessor for computing pairings over Barreto-Naehrig (BN) curves. The proposed design is specifically optimized for field-programmable gate-array (FPGA) platforms. The design explores the in- built features of an FPGA device for achieving an efficient cryptoprocessor for computing 128-bit secure
  18. 18. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com pairings. The work further pinpoints the vulnerability of those pairing computations against side-channel attacks and demonstrates experimentally that power consumptions of such devices can be used to attack these ciphers. Finally, we suggest a suitable countermeasure to overcome the respective weaknesses. The proposed secure cryptoprocessor needs 1 730 000, 1 206 000, and 821 000 cycles for the computation of Tate, ate, and optimal-ate pairings, respectively. The implementation results on a Virtex-6 FPGA device shows that it consumes 23 k Slices and computes the respective pairings in 11.93, 8.32, and 5.66 ms. ETPL VLSI-046 In-Situ Method for TSV Delay Testing and Characterization Using Input Sensitivity Analysis Abstract: In this paper, we propose a method and the required architecture for characterizing the propagation delays of the through Silicon vias (TSVs) in a 3-D IC. First of all, every two TSVs are paired up to form an oscillation ring with some peripheral circuits. Their joint performance can thus be measured roughly by the oscillation period of the ring. Next, we utilize a technique called sensitivity analysis to further derive the propagation delay of each individual TSV participating in an oscillation ring-a distilling process. In this process, we perturb the strength of the two TSV drivers, and then measure their effects in terms of the change of the oscillation ring's period. By some following analysis, the propagation delay of each TSV can be revealed. On top of scheme, we also present an architecture that can activate the performance characterization process of each test unit - that consists of two TSVs - one at a time in a proper sequence. The area overhead is only 18.97 equivalent two-input NAND gate per TSV, by which one can gain the ability to profile the capacitances and the propagation delays of the TSVs on a 3-D IC. ETPL VLSI-047 Low-Resolution DAC-Driven Linearity Testing of Higher Resolution ADCs Using Polynomial Fitting Measurements Abstract: A low-cost linearity test methodology for high-resolution analog-to-digital converters (ADCs) is presented in this paper. Linearity testing of ADCs requires high-precision digital-to-analog conversion (DAC) capability, commonly 3-bit higher resolution than the ADC under test. Further, a large number of ADC output data samples must be collected making conventional histogram testing impractical for high- resolution ADCs with 18-24 bit precision. In the proposed test methodology, two low-precision and low- cost DACs are used to generate a high-resolution ADC test stimulus. Significant reductions in test cost and test time are achieved by using low-cost instrumentation and by making fewer measurements than required for conventional histogram test. A least-squares-based polynomial fitting approach is used to determine the transfer function of the ADC under test. The generated transfer function is used to compute the non-linearity of the ADC accurately. No assumption is made regarding the linearity of the lower precision signal generators (DACs) used in the testing procedure. Software simulations and hardware experiments are performed to validate the proposed test methodology ETPL VLSI-048 Low-Cost Error Tolerance Scheme for 3-D CMOS Imagers Abstract: This paper presents an error tolerance scheme for 3-D CMOS imagers that are constructed by stacking a pixel array of imager sensors, an analog-to-digital converter (ADC) array, and an image signal processor (ISP) array using microbumps (μbumps) and through silicon vias (TSVs). To deliver high- quality images in the presence of single or multiple μbump, ADC, or TSV failures, we propose to interleave the connections from pixels to ADCs and recover the corrupted data in the ISPs. Key design parameters, such as the interleaving stride and the grouping ratio are determined by analyzing the employed error correction algorithm. Architectural simulation results demonstrate that the error tolerance
  19. 19. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com scheme enhances the effective yield of an exemplar 3-D imager from 44% to 97%. ETPL VLSI-049 Computing Two-Pattern Test Cubes for Transition Path Delay Faults Abstract: Considering full-scan circuits, incompletely-specified tests, or test cubes, are used for test data compression. When considering path delay faults, certain specified input values in a test cube are needed only for determining the lengths of the paths associated with detected faults. Path delay faults, and therefore, small delay defects, would still be detected if such values are unspecified. The goal of this paper is to explore the possibility of increasing the number of unspecified input values in a test set for path delay faults by unspecifying such values in order to make the test set more amenable to test data compression. Experimental results indicate that significant numbers of such values exist. The proposed procedure unspecifies them gradually to obtain a series of test sets with increasing numbers of unspecified values and decreasing path lengths. Experimental results also indicate that filling the unspecified values randomly (as with some test data compression methods) recovers some or all of the path lengths associated with detected path delay faults. The procedure uses a matching of the sets of detected faults for the comparison of path lengths ETPL VLSI-050 Integrated Energy-Harvesting Photodiodes With Diffractive Storage Capacitance Abstract: Integrating energy-harvesting photodiodes with logic and exploiting on-die interconnect capacitance for energy storage can enable new, ultraminiaturized wireless systems. Unlike CMOS imager pixels, the proposed photodiode designs utilize p-diffusion fingers and are implemented in a conventional logic process. Also unlike specialized solar cell processes, the designs utilize the on-chip metal interconnect to form a diffraction grating above the p-diffusion fingers which also provides capacitive energy storage. To explore the tradeoffs between optical efficiency and energy storage for integrated photodiodes, an array of photovoltaics with various diffractive storage capacitors was designed in a 90- nm CMOS logic process. The diffractive effects can be exploited to increase the photodiodes' response to off-axis illumination. Transient effects from interfacing the photodiodes with switched-capacitor DC-DC converters were examined, with measurements indicating a 50% reduction in the output voltage ripple due to the diffractive storage capacitance. A quantitative comparison between 90-nm and 0.35-μm CMOS logic processes for energy-harvesting capabilities was carried out. Measurements show an increase in power generation for the newer CMOS technology, however at the cost of reduced output voltage. One potential application for the integrated photodiodes is harvesting energy for a subdermal biomedical device. ETPL VLSI-051 Fast Fixed-Outline 3-D IC Floorplanning With TSV Co-Placement Abstract: Through-silicon vias (TSVs) are used to connect inter-die signals in a 3-D IC. Unlike conventional vias, TSVs occupy device area and are very large compared to logic gates. However, most previous 3-D floorplanners only view TSVs as points. As a result, whitespace redistribution is necessary for TSV insertion after the initial floorplan is computed, which leads to suboptimal layouts. In this paper, we propose a very efficient 3-D floorplanner to simultaneously floorplan the functional modules and place the TSVs and to optimize the total wirelength under fixed-outline constraint. Compared to the state-
  20. 20. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com of-the-art 3-D floorplanner with TSV planning, our design consistently produces better floorplans with 15% shorter wirelength and 31% fewer TSVs on average. Our algorithm is extremely fast and only takes a few seconds to floorplan benchmarks with hundreds of modules compared to hours as required by the previous state-of-the-art floorplanner. ETPL VLSI-052 Reactivation Noise Suppression With Sleep Signal Slew Rate Modulation in MTCMOS Circuits Abstract: Multi-threshold CMOS (MTCMOS) is commonly used for suppressing leakage currents in idle integrated circuits. Power and ground distribution network noise produced during SLEEP to ACTIVE mode transitions is an important reliability concern in MTCMOS circuits. Sleep signal slew rate modulation techniques for suppressing mode-transition noise are explored in this paper. A triple-phase sleep signal slew rate modulation (TPS) technique with a novel digital sleep signal generator is proposed. Reactivation time, mode-transition energy consumption, leakage power consumption, and layout area of different MTCMOS circuits are characterized under an equal-noise constraint. Influences of within-die and die-to-die parameter variations on the reactivation noise, time, and energy consumption of sleep signal slew rate modulated MTCMOS circuits are evaluated with a process imperfections aware robustness metric. The proposed triple-phase sleep signal slew rate modulation technique enhances the tolerance to process parameter fluctuations by up to 183.1× as compared to various alternative MTCMOS noise suppression techniques in a UMC 80-nm CMOS technology. ETPL VLSI-053 Sub-mW LC Dual-Input Injection-Locked Oscillator for Autonomous WBSNs Abstract: This paper presents a sub-mW, current-reused first-harmonic LC injection-locked oscillator (ILO) using in-phase dual-input injection technique. It can be used as a power oscillator in the injection- locked transmitter of wireless biomedical sensor nodes (WBSNs) integrated into a wireless body area network. A prototype chip, implemented in a standard 0.13-μm CMOS process occupying 200 × 380 μm, operates in the medical implantable communications service (MICS) band for medical implants. Measurement results show that the proposed ILO features a wide locking range of 800 MHz (150-950 MHz) at an input power of 0 dBm. More importantly, it has a high input sensitivity of -30 dBm to lock the 3-MHz bandwidth of the MICS band, while consuming only 660 μW at 1-V supply. This ultralow power consumption enables autonomous WBSNs ETPL VLSI-054 Constant Delay Logic Style Abstract: A constant delay (CD) logic style is proposed in this paper, targeting at full-custom high-speed applications. The CD characteristic of this logic style regardless of the logic type makes it suitable in implementing complicated logic expressions such as addition. CD logic exhibits a unique characteristic where the output is pre-evaluated before the inputs from the preceding stage is ready. This feature offers performance advantage over static and dynamic domino logic styles in a single-cycle multistage circuit block. Several design considerations including timing window width adjustment and clock distribution are discussed. Using 65-nm general-purpose CMOS technology, the proposed logic demonstrates an average speedup of 94% and 56% over static and dynamic domino logic, respectively, in five different logic gates. Simulation results of 8-bit ripple carry adders show that CD logic is 39% and 23% faster than the static and dynamic-based adders, respectively. CD logic also demonstrates 39% speedup and 64%
  21. 21. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com (22%) energy-delay product (EDP) reduction from static logic at 100% (10%) data activity in 32-bit carry lookahead adders. For 8-bit Wallace tree multiplier, CD logic achieves a similar speedup with at least 50% EDP reduction across all data activities. ETPL VLSI-055 A Compact Clock Generator for Heterogeneous GALS MPSoCs in 65-nm CMOS Technology Abstract: This paper presents an all-digital phase-locked loop (ADPLL) clock generator for globally asynchronous locally synchronous (GALS) multiprocessor systems-on-chip (MPSoCs). With its low power consumption of 2.7 mW and ultra small chip area of 0.0078 mm2 it can be instantiated per core for fine-grained power management like DVFS. It is based on an ADPLL providing a multiphase clock signal from which core frequencies from 83 to 666 MHz with 50% duty cycle are generated by phase rotation and frequency division. The clock meets the specification for DDR2/DDR3 memory interfaces. Additionally, it provides a dedicated high-speed clock up to 4 GHz for serial network-on-chip data links. Core frequencies can be changed arbitrarily within one clock cycle for fast dynamic frequency scaling applications. The performance including statistical analysis of mismatch has been verified by a prototype in 65-nm CMOS technology. ETPL VLSI-056 A Colpitts CMOS Quadrature VCO Using Direct Connection of Substrates for Coupling Abstract: A new low-phase noise low-power quadrature voltage-controlled oscillator (QVCO) using differential Colpitts oscillator is presented. The proposed QVCO is composed of two identical current- switching differential Colpitts VCOs in which the first core VCO is coupled to the second in an in-phase manner, and the second core VCO is coupled to the first in an anti-phase manner. To couple the two core VCOs, the substrates of the cross-connected transistors as well as the substrates of MOS varactors are used; alleviating the need for any extra elements for coupling, which could add noise and increase power dissipation. A linear (sinusoidal) analysis is presented that confirms that the proposed circuit generates quadrature waveforms. The proposed coupling technique can be generalized to N differential Colpitts VCOs for multiphase signals generation ETPL VLSI-057 A Self-Calibrated DLL-Based Clock Generator for an Energy-Aware EISC Processor Abstract: This paper describes a low-jitter delay-locked loop (DLL)-based clock generator for dynamic frequency scaling in the extendable instruction set computing (EISC) processor. The DLL-based clock generator provides the system clock with frequencies of 0.5× to 8× of the reference clock, according to the workload of the EISC processor. The proposed analog self-calibration method and a phase detector with an auxiliary charge pump can effectively reduce the delay mismatch between delay cells in the voltage-controlled delay line and the static phase offset due to the current mismatch in the charge pump, respectively. The self-calibrated output waveform exhibits 9.7 ps of RMS jitter and 73.7 ps of peak-to- peak jitter at 120 MHz. The prototype clock generator implemented in a 0.18-μm CMOS process occupies an active area of 0.27 mm2 and consumes 15.56 mA ETPL VLSI-058 Clamping Virtual Supply Voltage of Power-Gated Circuits for Active Leakage Reduction and Gate-Oxide Reliability
  22. 22. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Abstract: In an integrated circuit (IC) adopting a power-gating (PG) technique, the virtual supply voltage (VVDD) is susceptible to: 1) negative-bias temperature instability (NBTI) degradation that weakens the PG device over time and 2) temporal temperature variation that affects active leakage current (thus total current) of the IC. The PG device is sized to guarantee a minimum VVDD level over the chip lifetime. Thus, the NBTI degradation and the worst-case total current at high-temperature must be considered for sizing the PG device. This leads to higher VVDD (thus active leakage power) than necessary in early chip lifetime and/or at low temperature, negatively impacting the gate-oxide reliability of transistors. To reduce active leakage power increase and improve the gate-oxide reliability due to these effects, we propose two techniques that adjust the strength of a PG device based on its usage and IC's temperature at runtime. We demonstrate the efficacy of these techniques with an experimental setup using a 32-nm technology model in the presence of within-die spatial process and temperature variations. On an average of 100 die samples, they can reduce dynamic and active leakage power by up to 3.7% and 10% in early chip lifetime. Finally, these techniques also reduce the oxide failure rate by up to 5% across process corners over a period of 7 years. ETPL VLSI-059 10-bit 30-MS/s SAR ADC Using a Switchback Switching Method Abstract: This brief presents a 10-bit 30-MS/s successive-approximation-register analog-to-digital converter (ADC) that uses a power efficient switchback switching method. With respect to the monotonic switching method, the input common-mode voltage variation reduces which improves the dynamic offset and the parasitic capacitance variation of the comparator. The proposed switchback switching method does not consume any power at the first digital-to-analog converter switching, which can reduce the power consumption and design effort of the reference buffer. The prototype was fabricated in a 90-nm 1P9M CMOS technology. At 1-V supply and 30 MS/s, the ADC achieves an sequenced neighbor double reservation of 56.89 dB and consumes 0.98 mW, resulting in a figure-of-merit (FOM) of 57 fJ/conversion-step. The ADC core occupies an active area of only 190 × 525 μm2. ETPL VLSI-060 Spur-Reduction Frequency Synthesizer Exploiting Randomly Selected PFD Abstract: This brief presents a low-spur phase-locked loop (PLL) system for wireless applications. The low-spur frequency synthesizer randomizes the periodic ripples on the control voltage of the voltage- controlled oscillator to reduce the reference spur at the output of the PLL. A novel random clock generator is presented to perform the random selection of the phase frequency detector control for the charge pump in locked state. The proposed frequency synthesizer was fabricated in a TSMC 0.18-μm CMOS process. The proposed PLL achieved phase noise of -93 dBc/Hz with a 600-kHz offset frequency and reference spurs below -72 dBc. ETPL VLSI-061 Gain-Enhanced Monolithic Charge Pump With Simultaneous Dynamic Gate and Substrate Control Abstract: This brief presents a gain-enhanced complimentary metal-oxide-semiconductor (CMOS) charge pump (CP) circuit via dynamically controlling the gate and substrate terminals of each pMOS pass transistor. The proposed control strategy enables the CP circuit free of the threshold-voltage drops, the body effect, and the floating substrate terminals of pass devices. The on-resistance of each pass device is
  23. 23. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com also reduced to improve the gain and the power efficiency of the CP circuit. Implemented in a 0.35-μm single n-well CMOS process, the proposed four-stage monolithic CP circuit can operate with a supply voltage down to 0.9 V and deliver a maximum output current of about 100 μA. The proposed CP circuit also achieves a high voltage gain of 4 with two complementary-phase nonoverlapping clock signals. ETPL VLSI-062 Embedding Repeaters in Silicon IPs for Cross-IP Interconnections Abstract: During systems-on-a-chip (SoC) integration, silicon intellectual properties (IPs) are generally regarded as blockages to long interconnections that connect different IPs. With this constraint, conventional designs are forced to place those repeaters that drive long interconnections outside the IP. These designs either lead to a longer interconnection distance requiring more repeaters or result in a longer signal delay, since the interconnection wire is not appropriately segmented by the repeaters. To solve these problems, we designed the IPs such that designers can embed the repeaters in the IP for the SoC integration. In other words, it allows the cross-IP interconnections to be routed over the IP using repeaters inserted in the IP. The design concept, physical implementation, and application examples of the embedded repeaters are described in this brief ETPL VLSI-063 RATS: Restoration-Aware Trace Signal Selection for Post-Silicon Validation Abstract: Post-silicon validation is one of the most important and expensive tasks in modern integrated circuit design methodology. The primary problem governing post-silicon validation is the limited observability due to storage of a small number of signals in a trace buffer. The signals to be traced should be carefully selected in order to maximize restoration of the remaining signals. Existing approaches have two major drawbacks. They depend on partial restorability computations that are not effective in restoring maximum signal states. They also require long signal selection time due to inefficient computation as well as operating on gate-level netlist. We have proposed a signal selection approach based on total restorability at gate-level, which is computationally more efficient (10 times faster) and can restore up to three times more signals compared to existing methods. We have also developed a register transfer level signal selection approach, which reduces both memory requirements and signal selection time by several orders-of-magnitude. ETPL VLSI-064 Test Patterns of Multiple SIC Vectors: Theory and Application in BIST Schemes Abstract: This paper proposes a novel test pattern generator (TPG) for built-in self-test. Our method generates multiple single-input change (MSIC) vectors in a pattern, i.e., each vector applied to a scan chain is an SIC vector. A reconfigurable Johnson counter and a scalable SIC counter are developed to generate a class of minimum transition sequences. The proposed TPG is flexible to both the test-per-clock and the test-per-scan schemes. A theory is also developed to represent and analyze the sequences and to extract a class of MSIC sequences. Analysis results show that the produced MSIC sequences have the favorable features of uniform distribution and low input transition density. The performances of the designed TPGs and the circuits under test with 45 nm are evaluated. Simulation results with ISCAS benchmarks demonstrate that MSIC can save test power and impose no more than 7.5% overhead for a scan design. It also achieves the target fault coverage without increasing the test length.
  24. 24. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI-065 Effective and Efficient Approach for Power Reduction by Using Multi-Bit Flip-Flops Abstract: Power has become a burning issue in modern VLSI design. In modern integrated circuits, the power consumed by clocking gradually takes a dominant part. Given a design, we can reduce its power consumption by replacing some flip-flops with fewer multi-bit flip-flops. However, this procedure may affect the performance of the original circuit. Hence, the flip-flop replacement without timing and placement capacity constraints violation becomes a quite complex problem. To deal with the difficulty efficiently, we have proposed several techniques. First, we perform a co-ordinate transformation to identify those flip-flops that can be merged and their legal regions. Besides, we show how to build a combination table to enumerate possible combinations of flip-flops provided by a library. Finally, we use a hierarchical way to merge flip-flops. Besides power reduction, the objective of minimizing the total wirelength is also considered. The time complexity of our algorithm is $Theta({rm n}^{1.12})$ less than the empirical complexity of $Theta({rm n}^{2})$. According to the experimental results, our algorithm significantly reduces clock power by 20–30% and the running time is very short. In the largest test case, which contains 1 700 000 flip-flops, our algorithm only takes about 5 min to replace flip-flops and the power reduction can achieve 21%. ETPL VLSI-066 Reconfigurable Accelerator for the Word-Matching Stage of BLASTN Abstract: BLAST is one of the most popular sequence analysis tools used by molecular biologists. It is designed to efficiently find similar regions between two sequences that have biological significance. However, because the size of genomic databases is growing rapidly, the computation time of BLAST, when performing a complete genomic database search, is continuously increasing. Thus, there is a clear need to accelerate this process. In this paper, we present a new approach for genomic sequence database scanning utilizing reconfigurable field programmable gate array (FPGA)-based hardware. In order to derive an efficient structure for BLASTN, we propose a reconfigurable architecture to accelerate the computation of the word-matching stage. The experimental results show that the FPGA implementation achieves a speedup around one order of magnitude compared to the NCBI BLASTN software running on a general purpose computer. ETPL VLSI-067 Architecturally Homogeneous Power-Performance Heterogeneous Multicore Systems Abstract: Dynamic voltage and frequency scaling (DVFS), a widely adopted technique to ensure safe thermal characteristics while delivering superior energy efficiency, is rapidly becoming inefficient with technology scaling due to two critical factors: 1) inability to scale the supply voltage due to reliability concerns and 2) dynamic adaptations through DVFS cannot alter underlying power hungry circuit characteristics, designed for the nominal frequency. In this paper, we show that DVFS scaled circuits substantially lag in energy efficiency, by 22%–86%, compared to ground up designs for target frequency levels. We propose architecturally homogeneous power-performance heterogeneous multicore systems, a fundamentally alternate means to design energy efficient multicore systems. Using a system level computer-aided design (CAD) approach, we seamlessly integrate architecturally identical cores, designed for different voltage-frequency domains. We use a combination of standard cell library based CAD flow and full system architectural simulation to demonstrate 11%–22% improvement in energy efficiency using our design paradigm.
  25. 25. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI-068 Active Filter-Based Hybrid On-Chip DC–DC Converter for Point-of-Load Voltage Regulation Abstract: An active filter-based on-chip DC–DC voltage converter for application to distributed on-chip power supplies in multivoltage systems is described in this paper. No inductor or output capacitor is required in the proposed converter. The area of the voltage converter is therefore significantly less than that of a conventional low-dropout (LDO) regulator. Hence, the proposed circuit is appropriate for point- of-load voltage regulation for noise sensitive portions of an integrated circuit. The performance of the circuit has been verified with Cadence Spectre simulations and fabricated with a commercial 110 nm complimentary metal oxide semiconductor (CMOS) technology. The area of the voltage regulator is 0.015 ${rm mm}^{2}$ and delivers up to 80 mA of output current. The transient response with no output capacitor ranges from 72 to 192 ns. The parameter sensitivity of the active filter is also described. The advantages and disadvantages of the active filter-based, conventional switching, linear, and switched capacitor voltage converters are compared. The proposed circuit is an alternative to classical LDO voltage regulators, providing a means for distributing multiple local power supplies across an integrated circuit while maintaining high current efficiency and fast response time within a small area. ETPL VLSI-069 CusNoC: Fast Full-Chip Custom NoC Generation Abstract: We propose a full-chip synthesis methodology to construct custom network-on-chips (CusNoCs) for NoC-based systems. The proposed scheme generates irregular network topologies for application-specific designs with known communication demands. In this method, processors and the communication architecture can be synthesized simultaneously in the floorplanning process, and thus it is called CusNoC. CusNoC synthesizes CusNoC in two steps. The target network topology is first generated based on communication analysis. Processing elements are partitioned into groups such that the utility of routers will be maximized if a router is assigned to each group. In this way, the number of routers passed by a packet, or hops, is minimized, and so is the power consumption in the network. The final network topology is formed by properly connecting these groups. A wirelength-aware floor planning is then carried out to optimize circuit size as well as wirelength. Experimental results show that CusNoC produces custom NoCs with better performance than previous methods while the computation time is significantly shorter. This method is also more scalable, which makes it ideal for complicated systems. ETPL VLSI-070 Cooperating Virtual Memory and Write Buffer Management for Flash-Based Storage Systems Abstract: Flash memory is becoming the preferred choice of secondary storage in mobile devices and embedded systems. The performance of Flash memory is dictated by asymmetric speeds of read and write, limited number of erase times, and the absence of in-place updates. To improve the performance of Flash-based storage systems, the write buffer has been provided in Flash memories recently. At the same time, new virtual memory management strategies have been proposed in recent studies that consider the characteristics of Flash memory. Currently, approaches on these two memory layers are considered separately, which fail to explore the full potential of these two layers. In this paper, we propose cooperative management schemes for virtual memory and write buffer to maximize the performance of Flash-memory-based systems. Management on virtual memory is designed to exploit write buffer status via reordering of the write sequences. The proposed write buffer management scheme works seamlessly with the proposed virtual memory management scheme. Experimental results show that significant
  26. 26. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com improvement in I/O performance and reduction of the number of erase and write operations can be achieved compared to the state-of-art approaches. ETPL VLSI-071 MDC FFT/IFFT Processor With Variable Length for MIMO-OFDM Systems Abstract: This paper presents an multipath delay commutator (MDC)-based architecture and memory scheduling to implement fast Fourier transform (FFT) processors for multiple input multiple output- orthogonal frequency division multiplexing (MIMO-OFDM) systems with variable length. Based on the MDC architecture, we propose to use radix-$N_{s}$ butterflies at each stage, where $N_{s}$ is the number of data streams, so that there is only one butterfly needed in each stage. Consequently, a 100% utilization rate in computational elements is achieved. Moreover, thanks to the simple control mechanism of the MDC, we propose simple memory scheduling methods for input data and output bit/set-reversing, which again results in a full utilization rate in memory usage. Since the memory requirements usually dominate the die area of FFT/inverse fast Fourier transform (IFFT) processors, the proposed scheme can effectively reduce the memory size and thus the die area as well. Furthermore, to apply the proposed scheme in practical applications, we let $N_{s}=4$ and implement a 4-stream FFT/IFFT processor with variable length including 2048, 1024, 512, and 128 for MIMO-OFDM systems. This processor can be used in IEEE 802.16 WiMAX and 3GPP long term evolution applications. The processor was implemented with an UMC 90-nm CMOS technology with a core area of 3.1 ${rm mm}^{2}$. The power consumption at 40 MHz was 63.72/62.92/57.51/51.69 mW for 2048/1024/512/128-FFT, respectively in the post-layout simulation. Finally, we analyze the complexity and performance of the implemented processor and compare it with other processors. The results show advantages of the proposed scheme in terms of area and power consumption. ETPL VLSI-072 Current-Reused 2.4-GHz Direct-Modulation Transmitter With On-Chip Automatic Tuning Abstract: This paper presents the design, analysis, and experimental verification of a self-calibrating current-reused 2.4-GHz direct-modulation transmitter for short-range wireless applications. The key contributions are the design/analysis of a stacked power amplifier (PA)/voltage-controlled oscillator (VCO) architecture, the nonlinear frequency-dependent analysis of a Gilbert-cell-based root-mean-square detector, and an on-chip $LC$-tank calibration circuit that needs no analog-to-digital convertor (ADC)/digital signal processor. The stacked architecture reduces the number of required regulators, utilizes supply headroom effectively, and allows for an “ADC-less” calibration loop that can dynamically tune the PA center frequency by sensing the transmitted signal. The very nature of direct-modulation architecture obviates additional high-purity signal generators, reducing complexity and allowing online calibration. The system was implemented in TSMC 0.18 $mu{rm m}$ CMOS, occupies 0.7 ${rm mm}^{2}~({rm TX})+0.1~{rm mm}^{2}$ (self-tuning), and was measured in a QFN48 package on an FR4 PCB. Automatically correcting PA/VCO tank misalignment in this case yielded ${>}{rm 4}~{rm dB}$ increase in output power. With the automatic tuning active, the transmitter delivers a measured output power ${>}{rm 0}~{rm dBm}$ to a 100-$Omega$ differential load, and the system consumes 22.9 mA from a 1.8-V core-circuit supply.
  27. 27. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com ETPL VLSI-073 Reconfigurable Adaptive Singular Value Decomposition Engine Design for High- Throughput MIMO-OFDM Systems Abstract: Singular value decomposition (SVD) is an optimal method to obtain spatial multiplexing gain in multi-input multi-output (MIMO) channels. However, the high cost of implementation and high decomposing latency of the SVD restricts its usage in current wireless communication applications. In this paper, we present a complete adaptive SVD algorithm and a reconfigurable architecture for high- throughput MIMO-orthogonal frequency division multiplexing systems. There are several proposed architectural design techniques: reconfigurable scheme, division-free adaptive step size scheme, early termination scheme, and data interleaving scheme. The reconfigurable scheme can support all antenna configurations in a MIMO system. The division-free adaptive step size and early termination schemes are used to effectively reduce the decomposing latency and improve hardware utilization. The data interleaving scheme helps to deal with several channel matrices concurrently. Besides, we propose an orthogonal reconstruction scheme to obtain more accurate SVD outputs, and then the system performance will be greatly enhanced. We apply our SVD design to the IEEE 802.11 n applications. This design is implemented and fabricated in UMC 90 nm 1P9M CMOS technology. The maximum operating frequency is measured to be at 101.2 MHz, and the corresponding power dissipation is at 125 mW. The core size is 2.17 ${rm mm}^{2}$ and the die size occupies 4.93 ${rm mm}^{2}$. The chip result shows that the average latency is only 0.33% of the wireless local area network coherence time. Hence, the proposed reconfigurable adaptive SVD engine design is very suitable for high-throughput wireless communication applications. ETPL VLSI-074 The LUT-SR Family of Uniform Random Number Generators for FPGA Architectures Abstract: Field-programmable gate array (FPGA) optimized random number generators (RNGs) are more resource-efficient than software-optimized RNGs because they can take advantage of bitwise operations and FPGA-specific features. However, it is difficult to concisely describe FPGA-optimized RNGs, so they are not commonly used in real-world designs. This paper describes a type of FPGA RNG called a LUT-SR RNG, which takes advantage of bitwise xor operations and the ability to turn lookup tables (LUTs) into shift registers of varying lengths. This provides a good resource–quality balance compared to previous FPGA-optimized generators, between the previous high-resource high-period LUT-FIFO RNGs and low-resource low-quality LUT-OPT RNGs, with quality comparable to the best software generators. The LUT-SR generators can also be expressed using a simple C++ algorithm contained within this paper, allowing 60 fully-specified LUT-SR RNGs with different characteristics to be embedded in this paper, backed up by an online set of very high speed integrated circuit hardware description language (VHDL) generators and test benches. ETPL VLSI-075 Exploring the Use of Emerging Nonvolatile Memory Technologies in Future FPGAs, Abstract: As new nonvolatile memory technologies become increasingly mature, there has been a growing interest on investigating their use in future field-programmable gate arrays (FPGAs). Similar to existing FPGAs with embedded Flash memory, future FPGAs can embed these new nonvolatile memories to persistently store configuration data. By comparing with prior work, we first propose the more appropriate design style for new nonvolatile configuration data storage memory. Moreover, this brief studies a dynamic random-access memory (DRAM)-based FPGA design strategy enabled by high-
  28. 28. Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com density embedded nonvolatile memory. Existing FPGAs do not use on-chip DRAM cells for configuration data storage mainly because DRAM self-refresh involves destructive DRAM read. This problem can be solved, if we use embedded nonvolatile memory as primary FPGA configuration data storage and externally refresh on-chip DRAM cells. Analysis and simulations have been carried out to demonstrate the potential advantages of such a design strategy. ETPL VLSI-076 Broadside and Skewed-Load Tests Under Primary Input Constraints Abstract: Tester limitations may impose certain constraints on the primary input vectors applicable as part of a two-pattern test for delay faults. Under these constraints, the primary input vectors may be held constant, or the second primary input vector of a test may be obtained by a single shift of a scan chain relative to the first. The goal of this brief is to study the differences in achievable transition fault coverage between various primary input constraints that are similar to the commonly used ones of holding or shifting primary input vectors. This brief also studies the possibility of combining the constraints in order to increase the transition fault coverage. The combination requires a fixed and circuit-independent hardware structure similar to the case where shifting of primary input vectors is used. This study is done using test sets that consist of both broadside and skewed-load tests in order to maximize the transition fault coverage. ETPL VLSI-078 Supply Noise Suppression by Triple-Well Structure Abstract: This brief discusses the impact of twin- and triple-well structures on power supply noise, and a substrate model for simulating the power supply noise. We observed $V_{rm ss}$ noise reduction by the resistive network of the p-substrate and $V_{rm dd}$ noise reduction by the junction capacitance of a triple-well structure on a 90-nm test chip. Measurement results also showed that the total noise reduction of a triple-well structure is superior to that of a twin-well structure. The measurement results correlate well with the results obtained from the power supply noise simulation using a hierarchical resistive mesh model. Our simulation-based verification indicates that in common CMOS design, a triple-well structure can reduce the power supply drop by 10%–40% or the decoupling capacitance area by 5%–10%. We also verified that supply drop sensitivity to variation of the well junction capacitance is sufficiently small and that supply noise reduction using a triple-well structure is robust to process variation. ETPL VLSI-079 Software-Based Self Test Methodology for On-Line Testing of L1 Caches in Multithreaded Multicore Architectures Abstract: The flexibility that allows the application of different March tests is a critical requirement for on-line testing of memory arrays. In a previous study, we have introduced a low-cost software-based self test (SBST) program development methodology for on-line periodic testing of L1 caches that utilizes direct cache access (DCA) instructions and exploits the native monitoring hardware available in modern architectures. In this brief, we discuss a multithreaded optimization of this SBST methodology that exploits the thread level parallelism of multithreaded multicore architectures in order to speed up March test execution by elaborating the low level multiple sub-bank cache organization. The effectiveness of the methodology and its multithreaded optimization is demonstrated on the L1 caches of OpenSPARC T1 processor. Our results showed a speedup of more than 1.7 when the multithreaded optimization is applied and an acceptable performance overhead (less than 11%), even in intensive periodic test scenarios.

×