It's my Master Oral-defense presentation,
the title is "Towards Algorithmic Multi-ported Memory: Techniques and Design Trade-offs".
In Summary, we listed one of the design guideline for future works.
The AXI protocol specification describes an advanced bus architecture with burst-based transactions using separate address/control and data phases over independent channels. It supports features like out-of-order transaction completion, exclusive access for atomic operations, cache coherency, and a low power interface. The AXI protocol is commonly used in System-on-Chip designs for high performance embedded processors and peripherals.
Semiconductor engineering is becoming more dynamic fiels since the technology scaling is taking place. Power reduction techniques are lucrative solutions to the performance, area and power trade off. Therefore Power reduction of VLSI designs are critical.
The document discusses ring oscillators, which are electronic circuits that produce continuous waveforms without input. Ring oscillators use an odd number of inverters connected in a feedback loop. The document describes the design and simulation of 3-stage, 5-stage, and 9-stage ring oscillators using 180nm CMOS technology. Simulation results show frequency decreases and power consumption increases as the number of stages increases. The 5-stage ring oscillator has the lowest power consumption of the three.
http://www.fpga4student.com/2017/04/verilog-code-for-16-bit-risc-processor.html
Verilog code for 16-bit RISC processor on fpga4student.com
The instruction set is in the doc file.
VLSI power estimation is vital component of the modern electronic designs. Rapid changes in the advanced electronic infrastructure may causes the power to become paramount important in the VLSI designs.
This includes Digital signal data transmission, Base band and band pass transmission. Also detailed with PAM, PPM, PWM, PCM, DPCM, DM, ADM, ASK, PSK, FSK.
The AXI protocol specification describes an advanced bus architecture with burst-based transactions using separate address/control and data phases over independent channels. It supports features like out-of-order transaction completion, exclusive access for atomic operations, cache coherency, and a low power interface. The AXI protocol is commonly used in System-on-Chip designs for high performance embedded processors and peripherals.
Semiconductor engineering is becoming more dynamic fiels since the technology scaling is taking place. Power reduction techniques are lucrative solutions to the performance, area and power trade off. Therefore Power reduction of VLSI designs are critical.
The document discusses ring oscillators, which are electronic circuits that produce continuous waveforms without input. Ring oscillators use an odd number of inverters connected in a feedback loop. The document describes the design and simulation of 3-stage, 5-stage, and 9-stage ring oscillators using 180nm CMOS technology. Simulation results show frequency decreases and power consumption increases as the number of stages increases. The 5-stage ring oscillator has the lowest power consumption of the three.
http://www.fpga4student.com/2017/04/verilog-code-for-16-bit-risc-processor.html
Verilog code for 16-bit RISC processor on fpga4student.com
The instruction set is in the doc file.
VLSI power estimation is vital component of the modern electronic designs. Rapid changes in the advanced electronic infrastructure may causes the power to become paramount important in the VLSI designs.
This includes Digital signal data transmission, Base band and band pass transmission. Also detailed with PAM, PPM, PWM, PCM, DPCM, DM, ADM, ASK, PSK, FSK.
This document discusses engineering change orders (ECOs) used to fix timing, functional, power, and clock issues after physical design and sign-off. It describes the motivation for ECOs due to tool limitations and differences between implementation and sign-off. Common ECO techniques are listed for timing (driver upsizing, buffer insertion, etc.), power (vt-swapping, downsizing, etc.), and metal-only ECOs. Timing ECO tools from Synopsys, Cadence, and other vendors are also mentioned. Upcoming ECO technologies like dynamic power optimization and automatic legalization are noted.
The document describes a project on designing a 4-bit linear feedback shift register (LFSR). It discusses how an LFSR works by shifting bits left and applying an XOR operation to the last two bits. It then provides details on the circuit components including D flip-flops and XOR gates. Applications mentioned include generating pseudo-random numbers for cryptography, digital communications systems, and testing systems.
This document discusses FPGA based system design. It begins with an introduction to digital system design approaches, including using discrete logic gates on a board versus using a single programmable chip. It then covers the evolution of programmable logic devices from simple PLDs like PLA and PAL, to more complex CPLDs, and finally modern FPGAs. FPGAs contain logic blocks, programmable routing switches, and I/O pads. Commercial FPGA products from companies like Xilinx and Altera are also mentioned.
This document summarizes topics related to test generation for combination and sequential circuits, including:
- ATPG algorithms for combinational circuits like Boolean difference, single-path sensitization, D-algorithm, and PODEM.
- Problems with testing sequential circuits and approaches like time-frame expansion, simulation-based testing, and scan-based testing.
- Key concepts for ATPG algorithms like fault cones, forward and backward implication, essential prime implicants, and singular covers.
These filters have properties that lie between those of the Butterworth and Chebyshev filters. So it is appropriate to call this kind of filters as transitional Butterworth-Chebyshev filters.
Shift and Rotate Instructions
Shift and Rotate Applications
Multiplication and Division Instructions
Extended Addition and Subtraction
ASCII and Packed Decimal Arithmetic
CHI is an evolution of the ACE protocol and part of the AMBA architecture. It was designed to improve performance and scalability for applications in mobile, networking, automotive and data center systems. CHI uses a layered architecture with protocol, network and link layers. It supports coherency across processor clusters and memory with topologies like ring, mesh and crossbar. Key nodes include request nodes, home nodes and subordinate nodes. The system address map routes transactions between nodes using unique node IDs.
Raspberry Pi - Lecture 3 Embedded Communication ProtocolsMohamed Abdallah
The document discusses various embedded communication protocols. It begins by defining communication in embedded systems and examples of common protocols including UART, I2C, SPI, CAN and LIN. It then explains key concepts such as bit rate, baud rate, serial vs parallel communication and synchronous vs asynchronous communication. The document proceeds to provide detailed explanations of the UART, I2C and SPI protocols, including their frame formats, data validity rules, arbitration mechanisms and usage examples. It concludes by noting some key characteristics of each protocol.
VLSI lab manual Part A, VTU 7the sem KIT-tipturPramod Kumar S
This document provides procedures for digital and analog VLSI design experiments using CAD tools. It includes the VLSI design flow, prerequisites for using the tools, verification using simulation, and synthesis. Experiments cover digital logic gates, flip-flops, adders, counters, and analog circuits like inverters, amplifiers, and data converters. The goal is to introduce students to computer-aided design of digital and analog VLSI systems.
El documento describe el servicio GPRS (General Packet Radio Service), introducido por ETSI para permitir el acceso a redes de paquetes a través de protocolos como TCP/IP sin necesidad de conexiones de circuitos intermedios. GPRS permite una mayor velocidad de transferencia de datos y un modo de transmisión asimétrico compartiendo canales entre usuarios para mejorar la eficiencia. Esto permite acceder a internet, correo electrónico y aplicaciones móviles de forma más rápida y económica que en GSM.
This is mainly intended for young faculty who are involved in ARM processor architecture teaching. This may also be useful to those who are keen in understanding the secrets of ARM architecture.Very good luck
HBM stands for high bandwidth memory and is a type of memory interface used in 3D-stacked DRAM (dynamic random access memory) in GPUs, as well as the server, machine-learning DSP , high-performance computing and networking and client space.
Este documento resume conceptos clave de transmisión de datos, incluyendo modulación, circuitos de transmisión y recepción, adaptadores telegráficos, multiplexación, concentradores, codificación de información, redes de conmutación de circuitos y paquetes, y elementos básicos de redes de transmisión de datos.
2019 2 testing and verification of vlsi design_verificationUsha Mehta
This document provides an introduction to verification of VLSI designs and functional verification. It discusses sources of errors in specifications and implementations, ways to reduce human errors through automation and mistake-proofing techniques. It also covers the reconvergence model of verification, different verification methods like simulation, formal verification and techniques like equivalence checking and model checking. The document then discusses verification flows, test benches, different types of test cases and limitations of functional verification.
The document discusses pass transistor logic circuits. It describes how nMOS pass transistors can transfer logic 1 and 0 signals. Transmission gates are introduced which use both nMOS and pMOS pass transistors to pass strong signals in both directions. Applications of transmission gates include multiplexers, XOR gates, D latches, and D flip-flops. Clock skew management and different pass transistor logic families are also covered.
The document discusses VLSI design methodologies and limitations using CAD tools. It provides an overview of different VLSI design methodologies such as full custom design, semi-custom design, gate array design, standard cell design, FPGA-based design and CPLD-based design. It also discusses the evolution of VLSI design flows from past to present technologies. Furthermore, it describes the complexities in VLSI design and how CAD tools help manage these complexities and automate the design process. Finally, it summarizes different types of VLSI CAD tools and compares various open source and licensed CAD tool vendors.
This document discusses clock distribution networks in integrated circuits. It describes how clock generators produce timing signals to synchronize a system's operation using resonant circuits and amplifiers. As process technologies allow for higher integration and larger die sizes, clock networks must support higher frequencies while minimizing skew and jitter. Various clock distribution topologies are presented, including unconstrained trees, balanced trees, central spines, grids, and hybrid distributions, each with advantages and disadvantages depending on the design.
Placement is the process of determining the locations of circuit devices on a die surface. It is an important stage in the VLSI design flow, because it affects routabil- ity, performance, heat distribution, and to a less extent, power consumption of a design.
A typical design flow follows the below structure and can be broken down into multiple steps. Some of these phases happen in parallel and some in sequentially.
Requirements
A customer of a semiconductor firm is typically some other company who plans to use the chip in its systems or end products. So, the customer's requirements also play an important role in deciding how the chip should be designed.
The first step is to collect the requirements, estimate the end product's market value, and evaluate the number of resources required to do the project.
Specifications
The next step is to collect specifications that describe the functionality, interface abstractly, and over all architecture of the chip to be designed. This can be something along the lines such as:
Play
Next
Unmute
Current TimeÂ
0:00
/
DurationÂ
18:10
Â
Fullscreen
Backward Skip 10s
Play Video
Forward Skip 10s
Requires computational power to run imaging algorithms to support virtual reality.
Requires two ARM A53 processors with coherent interconnect and should run at 600 MHz.
Requires USB 3.0, Bluetooth, and PCIe 2nd gen interfaces.
It should support 1920x1080 pixel displays with an appropriate controller.
Digital Design
Because of the complex nature of modern chips, it's impossible to build something from scratch, and in many cases, many components will be reused.
For example, company A requires a FlexCAN module to interact with other modules in an automobile. They can either buy the FlexCAN design from another company to save time and effort or spend resources to build one.
It's not practical to design such a system from basic building blocks such as flip-flops and CMOS transistors.
Instead, a behavioral description is developed to analyze the design in terms of functionality, performance, and other high-level issues using a Hardware Description Language such as Verilog or VHDL.
This is usually done by a digital designer and is similar to a high-level computer programmer equipped with digital electronics skills.
Verification
Once the RTL design is ready, it needs to be verified for functional correctness.
For example, a DSP processor is expected to issue bus transactions with fetching instructions from memory and know that this will happen as expected.
The functional verification is required at this point, which is done with EDA simulators' help that can model the design and apply a different stimulus to it. This is the job of a pre-silicon verification engineer.
Logic Synthesis
Now we will convert this design into hardware schematic with real elements such as combinational gates and flip-flops. This step is called synthesis.
Logic synthesis tools enable the conversion of RTL description in HDL to a gate-level netlist. This netlist is a description of the circuit in terms of gates and connections between them.
Logic synthesis tools ensure that the netlist meets timing, area, and power specifications. Typically, they have access to different technology node
The document outlines the generalized ASIC design flow, including high level design, RTL design, system and timing verification, physical design involving floorplanning, placement and routing, and performance and manufacturability verification through extraction, timing analysis, and design rule checking. Key steps involve specification capture, logic design and verification through simulation, RTL synthesis, gate-level mapping to a target library, placement and routing, and post-layout verification.
This document discusses engineering change orders (ECOs) used to fix timing, functional, power, and clock issues after physical design and sign-off. It describes the motivation for ECOs due to tool limitations and differences between implementation and sign-off. Common ECO techniques are listed for timing (driver upsizing, buffer insertion, etc.), power (vt-swapping, downsizing, etc.), and metal-only ECOs. Timing ECO tools from Synopsys, Cadence, and other vendors are also mentioned. Upcoming ECO technologies like dynamic power optimization and automatic legalization are noted.
The document describes a project on designing a 4-bit linear feedback shift register (LFSR). It discusses how an LFSR works by shifting bits left and applying an XOR operation to the last two bits. It then provides details on the circuit components including D flip-flops and XOR gates. Applications mentioned include generating pseudo-random numbers for cryptography, digital communications systems, and testing systems.
This document discusses FPGA based system design. It begins with an introduction to digital system design approaches, including using discrete logic gates on a board versus using a single programmable chip. It then covers the evolution of programmable logic devices from simple PLDs like PLA and PAL, to more complex CPLDs, and finally modern FPGAs. FPGAs contain logic blocks, programmable routing switches, and I/O pads. Commercial FPGA products from companies like Xilinx and Altera are also mentioned.
This document summarizes topics related to test generation for combination and sequential circuits, including:
- ATPG algorithms for combinational circuits like Boolean difference, single-path sensitization, D-algorithm, and PODEM.
- Problems with testing sequential circuits and approaches like time-frame expansion, simulation-based testing, and scan-based testing.
- Key concepts for ATPG algorithms like fault cones, forward and backward implication, essential prime implicants, and singular covers.
These filters have properties that lie between those of the Butterworth and Chebyshev filters. So it is appropriate to call this kind of filters as transitional Butterworth-Chebyshev filters.
Shift and Rotate Instructions
Shift and Rotate Applications
Multiplication and Division Instructions
Extended Addition and Subtraction
ASCII and Packed Decimal Arithmetic
CHI is an evolution of the ACE protocol and part of the AMBA architecture. It was designed to improve performance and scalability for applications in mobile, networking, automotive and data center systems. CHI uses a layered architecture with protocol, network and link layers. It supports coherency across processor clusters and memory with topologies like ring, mesh and crossbar. Key nodes include request nodes, home nodes and subordinate nodes. The system address map routes transactions between nodes using unique node IDs.
Raspberry Pi - Lecture 3 Embedded Communication ProtocolsMohamed Abdallah
The document discusses various embedded communication protocols. It begins by defining communication in embedded systems and examples of common protocols including UART, I2C, SPI, CAN and LIN. It then explains key concepts such as bit rate, baud rate, serial vs parallel communication and synchronous vs asynchronous communication. The document proceeds to provide detailed explanations of the UART, I2C and SPI protocols, including their frame formats, data validity rules, arbitration mechanisms and usage examples. It concludes by noting some key characteristics of each protocol.
VLSI lab manual Part A, VTU 7the sem KIT-tipturPramod Kumar S
This document provides procedures for digital and analog VLSI design experiments using CAD tools. It includes the VLSI design flow, prerequisites for using the tools, verification using simulation, and synthesis. Experiments cover digital logic gates, flip-flops, adders, counters, and analog circuits like inverters, amplifiers, and data converters. The goal is to introduce students to computer-aided design of digital and analog VLSI systems.
El documento describe el servicio GPRS (General Packet Radio Service), introducido por ETSI para permitir el acceso a redes de paquetes a través de protocolos como TCP/IP sin necesidad de conexiones de circuitos intermedios. GPRS permite una mayor velocidad de transferencia de datos y un modo de transmisión asimétrico compartiendo canales entre usuarios para mejorar la eficiencia. Esto permite acceder a internet, correo electrónico y aplicaciones móviles de forma más rápida y económica que en GSM.
This is mainly intended for young faculty who are involved in ARM processor architecture teaching. This may also be useful to those who are keen in understanding the secrets of ARM architecture.Very good luck
HBM stands for high bandwidth memory and is a type of memory interface used in 3D-stacked DRAM (dynamic random access memory) in GPUs, as well as the server, machine-learning DSP , high-performance computing and networking and client space.
Este documento resume conceptos clave de transmisión de datos, incluyendo modulación, circuitos de transmisión y recepción, adaptadores telegráficos, multiplexación, concentradores, codificación de información, redes de conmutación de circuitos y paquetes, y elementos básicos de redes de transmisión de datos.
2019 2 testing and verification of vlsi design_verificationUsha Mehta
This document provides an introduction to verification of VLSI designs and functional verification. It discusses sources of errors in specifications and implementations, ways to reduce human errors through automation and mistake-proofing techniques. It also covers the reconvergence model of verification, different verification methods like simulation, formal verification and techniques like equivalence checking and model checking. The document then discusses verification flows, test benches, different types of test cases and limitations of functional verification.
The document discusses pass transistor logic circuits. It describes how nMOS pass transistors can transfer logic 1 and 0 signals. Transmission gates are introduced which use both nMOS and pMOS pass transistors to pass strong signals in both directions. Applications of transmission gates include multiplexers, XOR gates, D latches, and D flip-flops. Clock skew management and different pass transistor logic families are also covered.
The document discusses VLSI design methodologies and limitations using CAD tools. It provides an overview of different VLSI design methodologies such as full custom design, semi-custom design, gate array design, standard cell design, FPGA-based design and CPLD-based design. It also discusses the evolution of VLSI design flows from past to present technologies. Furthermore, it describes the complexities in VLSI design and how CAD tools help manage these complexities and automate the design process. Finally, it summarizes different types of VLSI CAD tools and compares various open source and licensed CAD tool vendors.
This document discusses clock distribution networks in integrated circuits. It describes how clock generators produce timing signals to synchronize a system's operation using resonant circuits and amplifiers. As process technologies allow for higher integration and larger die sizes, clock networks must support higher frequencies while minimizing skew and jitter. Various clock distribution topologies are presented, including unconstrained trees, balanced trees, central spines, grids, and hybrid distributions, each with advantages and disadvantages depending on the design.
Placement is the process of determining the locations of circuit devices on a die surface. It is an important stage in the VLSI design flow, because it affects routabil- ity, performance, heat distribution, and to a less extent, power consumption of a design.
A typical design flow follows the below structure and can be broken down into multiple steps. Some of these phases happen in parallel and some in sequentially.
Requirements
A customer of a semiconductor firm is typically some other company who plans to use the chip in its systems or end products. So, the customer's requirements also play an important role in deciding how the chip should be designed.
The first step is to collect the requirements, estimate the end product's market value, and evaluate the number of resources required to do the project.
Specifications
The next step is to collect specifications that describe the functionality, interface abstractly, and over all architecture of the chip to be designed. This can be something along the lines such as:
Play
Next
Unmute
Current TimeÂ
0:00
/
DurationÂ
18:10
Â
Fullscreen
Backward Skip 10s
Play Video
Forward Skip 10s
Requires computational power to run imaging algorithms to support virtual reality.
Requires two ARM A53 processors with coherent interconnect and should run at 600 MHz.
Requires USB 3.0, Bluetooth, and PCIe 2nd gen interfaces.
It should support 1920x1080 pixel displays with an appropriate controller.
Digital Design
Because of the complex nature of modern chips, it's impossible to build something from scratch, and in many cases, many components will be reused.
For example, company A requires a FlexCAN module to interact with other modules in an automobile. They can either buy the FlexCAN design from another company to save time and effort or spend resources to build one.
It's not practical to design such a system from basic building blocks such as flip-flops and CMOS transistors.
Instead, a behavioral description is developed to analyze the design in terms of functionality, performance, and other high-level issues using a Hardware Description Language such as Verilog or VHDL.
This is usually done by a digital designer and is similar to a high-level computer programmer equipped with digital electronics skills.
Verification
Once the RTL design is ready, it needs to be verified for functional correctness.
For example, a DSP processor is expected to issue bus transactions with fetching instructions from memory and know that this will happen as expected.
The functional verification is required at this point, which is done with EDA simulators' help that can model the design and apply a different stimulus to it. This is the job of a pre-silicon verification engineer.
Logic Synthesis
Now we will convert this design into hardware schematic with real elements such as combinational gates and flip-flops. This step is called synthesis.
Logic synthesis tools enable the conversion of RTL description in HDL to a gate-level netlist. This netlist is a description of the circuit in terms of gates and connections between them.
Logic synthesis tools ensure that the netlist meets timing, area, and power specifications. Typically, they have access to different technology node
The document outlines the generalized ASIC design flow, including high level design, RTL design, system and timing verification, physical design involving floorplanning, placement and routing, and performance and manufacturability verification through extraction, timing analysis, and design rule checking. Key steps involve specification capture, logic design and verification through simulation, RTL synthesis, gate-level mapping to a target library, placement and routing, and post-layout verification.
MODULE II Control unit, I/O systems and Pipelining 15 Hours
CPU control unit design: Hardwired and micro-programmed design approaches, Peripheral
devices and their characteristics: Input-output subsystems, I/O device interface, I/O transfersprogram controlled, interrupt driven and DMA, privileged and non-privileged instructions, software
interrupts and exceptions. Programs and processes-role of interrupts in process state transitions,
I/O device interfaces - SCII, USB. Basic concepts of pipelining, throughput and speedup, pipeline
hazards.
This document provides information on the course EC8552 Computer Architecture and Organization. The objectives of the course are to understand MIPS instruction set architecture, arithmetic and logic units, data and control paths, memory and I/O organization, and parallel processing architectures. The outcomes are that students will be able to analyze computer system performance, illustrate arithmetic operations, describe pipelining and hazards, explain memory and I/O, and interpret parallel architectures. Assessments include tests, quizzes, assignments, and tutorials. The course will use an online Canvas platform.
Tony Gibbs gave a presentation on Amazon Redshift covering its history, architecture, concepts, and parallelism. The presentation included details on Redshift's cluster architecture, node components, storage design, data distribution styles, and terminology. It also provided a deep dive on parallelism in Redshift, explaining how queries are compiled and executed through streams, segments, and steps to enable massively parallel processing across nodes.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and use work load management.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently
• Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features
Who Should Attend:
• Data Warehouse Developers, Big Data Architects, BI Managers, and Data Engineers
Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in H...Ishan Thakkar
This paper presents a novel, energy-efficient DRAM re-fresh technique called massed refresh that simultaneously leverages bank-level and subarray-level concurrency to reduce the overhead of distributed refresh operations in the Hybrid Memory Cube (HMC). In massed refresh, a bundle of DRAM rows in a refresh operation is composed of two subgroups mapped to two different banks, with the rows of each subgroup mapped to different subarrays within the corresponding bank. Both subgroups of DRAM rows are refreshed concurrently during a refresh com-mand, which greatly reduces the refresh cycle time and improves bandwidth and energy efficiency of the HMC. Our experimental analysis shows that the proposed massed refresh technique achieves up to 6.3% and 5.8% improvements in throughput and energy-delay product on average over JEDEC standardized distributed per-bank refresh and state-of-the-art scattered refresh tech-niques.
This document provides an introduction and overview of the TMS system architecture and MongoDB. It discusses the TMS modules, CDR processing flow, MongoDB features, data model, queries, replica sets, user roles, and monitoring tools. The presentation aims to explain how MongoDB is used in the TMS system and demonstrate common operations.
This document provides guidance on database sizing including:
1. Reasons to size a database initially and continually such as selecting hardware, storage requirements, and understanding data characteristics.
2. Common data types and their storage sizes in bytes.
3. How to calculate average row size and the number of rows that fit in a database block.
4. How to calculate the number of blocks needed to store a table based on its number of rows and the rows per block.
5. Differences in sizing indexes compared to tables.
6. The process of sizing all major database objects and adding them to determine total disk space needs.
The document discusses the need for a W3C community group on RDF stream processing. It notes there is currently heterogeneity in RDF stream models, query languages, implementations, and operational semantics. The speaker proposes creating a W3C community group to better understand these differences, requirements, and potentially develop recommendations. The group's mission would be to define common models for producing, transmitting, and continuously querying RDF streams. The presentation provides examples of use cases and outlines a template for describing them to collect more cases to understand requirements.
This document discusses optimizing columnar data stores. It begins with an overview of row-oriented versus column-oriented data stores, noting that column stores are well-suited for read-heavy analytical loads as they only need to read relevant data. The document then covers the history of columnar stores and notable features like data encoding, compression, and lazy decompression. It provides examples of run length and dictionary encoding. The document also discusses columnar file formats like RCFile, ORC, and Parquet, providing more details on ORC. It concludes with a case study where optimizations to a petabyte-scale data warehouse including sorting, changed compression, and other configuration changes improved query performance significantly through reduced data size.
This document discusses optimizing columnar data stores. It begins with an overview of row-oriented versus column-oriented data stores, noting that column stores are well-suited for read-heavy analytical loads as they only need to read relevant data. The document then covers the history of columnar stores and notable features like data encoding, compression techniques like run-length encoding, and lazy decompression. Specific columnar file formats like RCFile, ORC, and Parquet are mentioned. The document concludes with a case study describing optimizations made to a 1PB Hive table that resulted in a 3x query performance improvement through techniques like explicit sorting, improved compression, increased bucketing, and stripe size tuning.
The document discusses vector computers and their instruction sets. It begins by defining supercomputers and describing the CDC 6600, one of the earliest supercomputers. It then covers vector instruction sets, how vector instructions are executed in parallel across functional units and memory banks, and techniques like vector chaining and stripmining that improve performance. Overall, the document provides an overview of the design of vector computers and their vector processing capabilities.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this webinar, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to design schemas and load data efficiently
• Learn best practices for workload management, distribution and sort keys, and optimizing queries
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
The document describes C-Store, a column-oriented database management system. Some key points:
- C-Store stores data by column rather than by row to optimize for analytics queries that access a small number of columns from large tables.
- It uses column compression techniques, big disk blocks, and materialized views over columns rather than secondary indexes to improve read performance.
- Updates are handled by a write-optimized column store that periodically merges data into the read-optimized main store using a "tuple mover." This provides a hybrid approach between update-heavy row stores and read-heavy column stores.
Sap technical deep dive in a column oriented in memory databaseAlexander Talac
The document describes a lecture on column-oriented in-memory databases. The lecture covers the status quo of enterprise computing, database storage techniques like row and column storage layouts, in-memory database operators like scanning and aggregation, and advanced storage techniques like dictionary encoding and tuple reconstruction. The goal is to provide a deep technical understanding of column-oriented in-memory databases and their application in enterprise systems.
Similar to Algorithmic Multi-ported Memory(MEM) - Comprehensive Techniques Guideline (20)
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
1. Student: Chun-Feng Chen
Advisor: Bo-Cheng Charles Lai
Mar. 8th, 2018
NCTU Institute of Electronics – Parallel Computing System Lab (NCTU PCS) Hsinchu, R.O.C
Towards Algorithmic Multi-ported Memory :
Techniques and Design Trade-offs
4. Chun-Feng Chen NCTU_IEE - PCS Lab
Introduction
• Algorithmic Multi-ported Memory (AMM)
• Multi-ported memories are important functional modules in modern digital systems
• E.g. shared cache in multi-core processors, routing tables of switches, etc.
• AMM composes simple SRAMs and logic to support multiple reads and writes
• Potential to attain better performance than circuit-based approaches (CMM)
• Most of the previous works on FPGA
• Laforest, Charles Eric, et al. "Efficient Multi-ported Memories for FPGAs" [ACM 2010]
• Charles Eric Laforest, et al. "Multi-ported Memories for FPGAs via XOR" [ACM 2012]
• Charles Eric Laforest, et al. "Composing Multi-Ported Memories on FPGAs" [ACM 2014]
• Jiun-Liang Lin, et al. "BRAM Efficient Multi-ported Memory on FPGAs" [VLSI-DAT 2015]
• Jiun-Liang Lin, et al. "Efficient Designs of Multi-ported Memory on FPGAs" [TVLSI 2016]
• Kun-Hua Huang, et al. "An Efficient Hierarchical Banking Structure for Algorithmic Multi-
ported Memory on FPGAs" [TVLSI 2017]
• Sundar Iyer, et al. "Algorithmic Memory Brings an Order of Magnitude Performance Increase
to Next Generation SoC Memories" [DesignCon 2012]
4
5. Chun-Feng Chen NCTU_IEE - PCS Lab
Motivation
– FPGA Limits AMM Exploration
• The limited resource on FPGA constrains the AMM exploration
• Limited number of BRAMs, F/F, slice LUTs, etc
• Unable to explore important design factors of AMM
• Number of ports, memory depth, banking structures
• AMM has more significant benefit for greater depth
• E.g. from 512K to 16M-depth, AMM 4R1W attains 1.25% to 36.47% shorter latency better than
circuit-based approaches
• BRAM size is fixed
• 1K depth of 32-bit data width
• Unable to explore impact of different bank sizes
• E.g. choose proper banking structure for 2R1W can enhance latency/area/power up to
9.53%/56.86%/33.39%
• BRAM port configuration is fixed
• dual-port 2RW mode
• Unable to explore impact of different bank port configurations (4R1W, 2R2W, etc)
• E.g. choose proper building memory cell for 8R4W can enhance the area/power up to
70.0%/6.37x
5
6. Chun-Feng Chen NCTU_IEE - PCS Lab
Our Contributions
• Implement all the AMM designs on 45nm technology
• Use SRAM as building memory cell
• Explore important design factors of AMM
• Different AMM designs, memory depth, port configurations, banking
structures, building memory cells, etc.
• Extensive experiments and comprehensive analysis
• Summarize observations into design guidelines
6
8. • Non-table-based schemes
• Duplicate memory module
• E.g. NTRep-Rd [ACM 2010]
• Table-based schemes
• Adopt lookup tables to track the
stored up-to-date data address
• E.g. TBLVT [ACM 2010]
Chun-Feng Chen NCTU_IEE - PCS Lab
Algorithmic Multi-ported Memory
(AMM) Techniques Categorize
8
9. • Non-table-based approaches
• Use multiple banks to support multiple accesses
• Store parity data to support multiple reads and enable multiple
writes [VLSI-DAT’15, TVLSI’16, TVLSI’17]
• HB-NTX-RdWr can scale the number of ports with a systematic flow
[TVLSI’17]
• Table-based approaches
• Use multiple memory modules to support multiple accesses
• Use lookup tables to avoid module conflict and track the most up-to-
date values [VLSI-DAT’ 15, TVLSI’ 16, TVLSI’ 17]
Chun-Feng Chen NCTU_IEE - PCS Lab
AMM - Previous Proposed Designs
9
12. • A mR1W memory module of
NTRep-Rd technique
• Duplicate memory modules to
support multiple read ports
• Only one write port connects each
memory module
[7] LaForest, Charles Eric, and J. Gregory Steffan. "Efficient Multi-ported Memories for FPGAs." Proceedings of the 18th Annual
ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGAs). ACM, 2010.
Chun-Feng Chen NCTU_IEE - PCS Lab
Non-Table-Based Replication Multiple
Reads (NTRep-Rd) - [ACM’10, TRETs’14]
12
16. Hierarchical Banking Non-Table XOR-Based
Multiple Reads (HB-NTX-Rd) - [TVLSI’17]
[12] Lai, Bo-Cheng Charles, and Kun-Hua Huang. "An Efficient Hierarchical Banking Structure for Algorithmic Multiported Memory on
FPGAs." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25.10 (2017): 2776-2788.
Chun-Feng Chen NCTU_IEE - PCS Lab
• 4 read and 1 write memory
• Scale the B-NTX-Rd to more reads in a
hierarchical structure
• Use the 2R1W/3R as building modules
• Case 1: 5R (no write request)
• Three reads access BANK0
• Other two reads access the other banks
• Case 2: 4R1W (one write request)
• W0 and two reads access BANK0
• Other two reads access the other banks
• W0 stores directly, and reads BANK1 for
updating XOR-BANK
16
18. • A 1R2W memory module of NTX-Wr technique
• Duplicate memory modules and store XOR-encoded values to support multiple
read and writes
[9] LaForest, Charles Eric, et al. "Multi-ported Memories for FPGAs via XOR.” Proceedings of the ACM/SIGDA International Symposium on
Field Programmable Gate Arrays (FPGAs). ACM, 2012.
Chun-Feng Chen NCTU_IEE - PCS Lab
Non-Table-Based XOR Multiple Writes
(NTX-Wr) - [ACM’12, TRETs’14]
W0’ = (W0 D0) D0
W1’ = (W1 D1) D1
W0’ W0’
W1’
W1’
18
22. • 2 read and 2 write memory
• Integrates HB-NTX-Rd and HB-NTX-Wr to enable multiple reads and writes
• Use HB-NTX-Rd 4R1W/5R as building memory modules
(b) Conflict-Write case(a) Non-Conflict-Write case
Chun-Feng Chen NCTU_IEE - PCS Lab
Hierarchical Banking Non-Table XOR-Based Multiple
Reads and Writes (HB-NTX-RdWr) - [TVLSI’17]
[12] Lai, Bo-Cheng Charles, and Kun-Hua Huang. "An Efficient Hierarchical Banking Structure for Algorithmic Multiported Memory on
FPGAs." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25.10 (2017): 2776-2788.
22
23. • The top-down flow increases read ports with HB-NTX-Rd, while the left-
right flow increases write ports with HB-NTX-Wr
Chun-Feng Chen NCTU_IEE - PCS Lab
HB-NTX-RdWr Systematic Flow
- [TVLSI’17]
23
24. Table-Based
Approaches
Multiple Reads and Writes:
(1) Table-Based Live Value Table (TBLVT)
(2) Table-Based Remap Table (TBRemap)
(3) Enhancing Table-Based with Reduce Lookup Tables
(TBLVT_HB-NTX-RdWr)
25. Table-Based
Approaches
Multiple Reads and Writes:
(1) Table-Based Live Value Table (TBLVT)
(2) Table-Based Remap Table (TBRemap)
(3) Enhancing Table-Based with Reduce Lookup Tables
(TBLVT_HB-NTX-RdWr)
26. [7] LaForest, Charles Eric, and J. Gregory Steffan. "Efficient Multi-ported Memories for FPGAs." Proceedings of the 18th Annual
ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGAs). ACM, 2010.
Chun-Feng Chen NCTU_IEE - PCS Lab
Table-Based Live Value Table (TBLVT)
- [ACM’10, TRETs’14]
• Write request
• Dedicate a write data to a certain
memory module
• Lookup table (LVT) traces the latest
location
• Read request will query the LVT first
and then access the data from correct
memory location
• Design of the LVT size:
• log2(#NumModules) x MemoryDepth
26
27. Table-Based
Approaches
Multiple Reads and Writes:
(1) Table-Based Live Value Table (TBLVT)
(2) Table-Based Remap Table (TBRemap)
(3) Enhancing Table-Based with Reduce Lookup Tables
(TBLVT_HB-NTX-RdWr)
28. [11] Lai, Bo-Cheng Charles, and Jiun-Liang Lin. "Efficient Designs of Multi-ported Memory on FPGAs." IEEE Transactions on Very Large
Scale Integration (VLSI) Systems 25.1 (2017): 139-150.
Chun-Feng Chen NCTU_IEE - PCS Lab
Table-Based Remap (TBRemap)
– [VLSI-DAT’15, TVLSI’16]
• Remap functions:
• Apply banking structure designs
• All the reads and writes need to check
remap table to determine which
memory bank to access
• Use a HWC to distribute the multiple
write into writes, and a remap table to
track the latest location
• Design of the Remap size:
• ([log2(#DataBanks + 1)] – 1) x
MemoryDepth
28
29. Table-Based
Approaches
Multiple Reads and Writes:
(1) Table-Based Live Value Table (TBLVT)
(2) Table-Based Remap Table (TBRemap)
(3) Enhancing Table-Based with Reduce Lookup Tables
(TBLVT_HB-NTX-RdWr)
30. Chun-Feng Chen NCTU_IEE - PCS Lab
Enhancing Table-Based Design with
Reduce Lookup Table
• NTRep-Rd mR1W modules replaced
by HB-NTX-RdWr mRnW modules
• Reduce lookup table size while uses
less modules, to alleviate the routing
complexity for latency critical path
• Example: A 2R4W 8K-depth memory
• Original TBLVT needs four NTRep-Rd
2R1W as building modules
• LVTSize = 2-bit x 8K-depth
• Enhance TBLVT needs two HB-NTX-
RdWr 2R2W as building modules
• LVTSize = 1-bit x 8K-depth
30
31. Performance and Impact
of Design Factors
A. Experiments Setup
B. Circuit-level vs. Algorithmic Scheme
C. Overall Performance and Cost (Non-Table-Based vs. Table-Based)
D. Impact of Banking Structure
E. Scalability with Memory Depths and Number of Ports
F. Proper Tradeoff Between Circuit-level and Algorithmic Memory
F.(1) Different Basic SRAM Modules
F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
32. Performance and Impact
of Design Factors
A. Experiments Setup
B. Circuit-level vs. Algorithmic Scheme
C. Overall Performance and Cost (Non-Table-Based vs. Table-Based)
D. Impact of Banking Structure
E. Scalability with Memory Depths and Number of Ports
F. Proper Tradeoff Between Circuit-level and Algorithmic Memory
F.(1) Different Basic SRAM Modules
F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
33. Chun-Feng Chen NCTU_IEE - PCS Lab
A. Experiments Setup
– Read/Write Path Logic
• The block diagram of an AMM architecture, including write-path logic, read-path
logic, and building memory cells
• Write-path performs data manipulation design, e.g. replication, and lookup table… etc.
• Read-path performs retrieve the data from memory cells and decoding correct data
33
= Design Compiler synthesis RTL with TSMC 45nm
34. Chun-Feng Chen NCTU_IEE - PCS Lab
A. Experiments Setup
- Memory Cells (SRAM)
• The block diagram of an AMM architecture, including write-path logic, read-path
logic, and building memory cells
• Memory cells are composed by SRAMs, e.g. single-port or dual-port mode
34
= CACTI integrated memory model to estimate the
performance of different SRAM modules with TSMC 45nm
35. Chun-Feng Chen NCTU_IEE - PCS Lab
A. Experiments Setup - Algorithmic
Multi-ported Memory Architecture
• By combining the synthesis results of read-path and write-path logic, and
estimation from CACTI, we can evaluate the overall performance and cost
of an AMM design
35
= Overall performance of an AMM designs++
36. Performance and Impact
of Design Factors
A. Experiments Setup
B. Circuit-level vs. Algorithmic Scheme
C. Overall Performance and Cost (Non-Table-Based vs. Table-Based)
D. Impact of Banking Structure
E. Scalability with Memory Depths and Number of Ports
F. Proper Tradeoff Between Circuit-level and Algorithmic Memory
F.(1) Different Basic SRAM Modules
F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
37. Chun-Feng Chen NCTU_IEE - PCS Lab
B. Circuit-level vs. Algorithmic Schemes
– 2RnW (Latency)
37
l CMM has shorter latency for shallow memory depth (< 64K)
AMM has short latency for greater memory depth (> 128K)
l AMM is more scalable when increasing memory depth
For 2R8W, from 16K to 16M, the latency increases by:
HB-NTX-RdWr (1.52x), TBLVT_B-NTX-Rd (2.85x), CMM (23.46x)
l Non-table designs have shorter latency than table-based
For 2R8W, HB-NTX-Rd attains 4.23% to 95.09% shorter latencies
than TBLVT_B-NTX-Rd from 16K to 16M
38. Chun-Feng Chen NCTU_IEE - PCS Lab
B. Circuit-level vs. Algorithmic Schemes
– mR1W (Area)
38
l Non-table-based
AMM CMM
l 2R1W attains 6.38% to 36.2%
l 4R1W attains 15.27% to 67%
l 8R1W attains 59.33% to 3.01x smaller area when
compared with CMM
39. Chun-Feng Chen NCTU_IEE - PCS Lab
B. Circuit-level vs. Algorithmic Schemes
– 2RnW (Area)
39
l Non-table-based
CMM
l Table-based designs still attain smaller area over CMM
For 2R8W TBLVT_B-NTX-Rd, can attains 2.01% to
22.79% smaller area from 64K to 16M
l Table-based memory cell ,
table-based non-table-based
40. Chun-Feng Chen NCTU_IEE - PCS Lab
B. Circuit-level vs. Algorithmic Schemes
– 2RnW (Power)
40
l AMM has lower power (ever for non-table-based designs)
For AMM, access data
random bank
access data AMM
l For 2R8W HB-NTX-RdWr, can attains 45.59% to 2.1x
from 512K to 16M
l For 2R8W TBLVT_B-NTX-Rd, can attains 3.29% to
4.71x from 1K to 16M
41. Performance and Impact
of Design Factors
A. Experiments Setup
B. Circuit-level vs. Algorithmic Scheme
C. Overall Performance and Cost (Non-Table-Based vs. Table-Based)
D. Impact of Banking Structure
E. Scalability with Memory Depths and Number of Ports
F. Proper Tradeoff Between Circuit-level and Algorithmic Memory
F.(1) Different Basic SRAM Modules
F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
42. Chun-Feng Chen NCTU_IEE - PCS Lab
C. Overall Performance and Cost
- Non-Table-Based vs. Table-Based (Latency)
l The non-table-based schemes have shorter access
latencies with simple logic operations; table-based
schemes are impacted by routing path to lookup table
l The latency of AMM is mainly determined by the SRAM
modules for greater memory depth
For example: for 1R2W, 1K to 1M, memory cells account:
NTX-Wr: 28.82% to 71.55% of overall latency
TBLVT: 26.94% to 69.61% of overall latency
42
43. Chun-Feng Chen NCTU_IEE - PCS Lab
C. Overall Performance and Cost
- Non-Table-Based vs. Table-Based (Area)
l The area of AMM is mainly determined by the SRAM
modules
For example, 1R2W, 1K to 1M, memory cells account:
NTX-Wr: 92.52% to 99.99% of overall area
TBLVT: 93.11% to 99.99% of overall area
l Table-based ,
table-based non-table-based
43
44. Chun-Feng Chen NCTU_IEE - PCS Lab
C. Overall Performance and Cost
- Non-Table-Based vs. Table-Based (Power)
44
l The power of AMM is mainly determined by the SRAM
modules
For example, for 1R2W, 1K to 1M, memory cells account:
NTX-Wr: 89.27% to 99.97% of overall power
TBLVT: 88.9% to 99.97% of overall power
l Table-based , table-
based non-table-based
45. Performance and Impact
of Design Factors
A. Experiments Setup
B. Circuit-level vs. Algorithmic Scheme
C. Overall Performance and Cost (Non-Table-Based vs. Table-Based)
D. Impact of Banking Structure
E. Scalability with Memory Depths and Number of Ports
F. Proper Tradeoff Between Circuit-level and Algorithmic Memory
F.(1) Different Basic SRAM Modules
F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
46. Chun-Feng Chen NCTU_IEE - PCS Lab
D. Impact of Banking Structure
– Non-Table-Based (Latency)
l Banking structure is a tradeoff between memory cell and logic
l The best banking structure would be different according to
designs
l For example, for 2R1W, the 32-bank has the shortest latency
among all the other banking structures
46
47. Chun-Feng Chen NCTU_IEE - PCS Lab
D. Impact of Banking Structure
– Non-Table-Based (Area)
47
l Area efficiency: (Area)/(data size), data
memory cell area
l 8-bank has the smaller area among all the other designs
l Area of memory cell is the dominant factor of overall
area:
Logic only occupies 0.0782% (1bank) to 8.87%(256bank)
of overall area
2
1.502
1.253 1.13
1.17 1.246
1.346
1.566
1.80
48. Chun-Feng Chen NCTU_IEE - PCS Lab
D. Impact of Banking Structure
– Non-Table-Based (Power)
48
l Power efficiency: (Power)/(data size), data
power access
l 8-bank has the lower power among all the other designs
l Power of memory cell is the dominant factor of overall
power:
Logic only occupies : 0.201% (1bank) to 10.653%
(256bank) of overall power
1.68
1.29
1.17
1.13 1.27
1.46
1.74
1.866
2.01
49. Performance and Impact
of Design Factors
A. Experiments Setup
B. Circuit-level vs. Algorithmic Scheme
C. Overall Performance and Cost (Non-Table-Based vs. Table-Based)
D. Impact of Banking Structure
E. Scalability with Memory Depths and Number of Ports
F. Proper Tradeoff Between Circuit-level and Algorithmic Memory
F.(1) Different Basic SRAM Modules
F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
50. Chun-Feng Chen NCTU_IEE - PCS Lab
E. Scalability with Memory Depths and
Number of Ports - Latency
l The AMM performance is mainly determined by
the SRAM modules
l For 2R4W HB-NTX-RdWr, (64K to 128K)
& (256K to 512K) latency
l For 2R4W HB-NTX-RdWr, (128K to 256K)
latency
50
2K 2RW
4K 2RW
51. Chun-Feng Chen NCTU_IEE - PCS Lab
E. Scalability with Memory Depths and
Number of Ports - Area
51
l The AMM performance is mainly determined
by the SRAM modules
l HB-NTX-RdWr attains smaller area than NTX-
Wr_B-NTX-Rd
l For 2R8W, HB-NTX-RdWr attains 11.21% to
2.33x smaller area from 16K to 1M, than NTX-
Wr_B-NTX-Rd
52. Chun-Feng Chen NCTU_IEE - PCS Lab
E. Scalability with Memory Depths and
Number of Ports - Power
52
l The AMM performance is mainly determined by
the SRAM modules
l HB-NTX-RdWr attains lower power than NTX-
Wr_B-NTX-Rd
l For 2R8W, HB-NTX-RdWr attains 9.51% to
55.39% lower power from 16K to 1M, than
NTX-Wr_B-NTX-Rd
53. Performance and Impact
of Design Factors
A. Experiments Setup
B. Circuit-level vs. Algorithmic Scheme
C. Overall Performance and Cost (Non-Table-Based vs. Table-Based)
D. Impact of Banking Structure
E. Scalability with Memory Depths and Number of Ports
F. Proper Tradeoff Between Circuit-level and Algorithmic Memory
F.(1) Different Basic SRAM Modules
F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
54. Chun-Feng Chen NCTU_IEE - PCS Lab
F.(1) Different Basic SRAM Modules
- Latency
For a wide range of sizes, all these basic SRAM
modules pose very similar performance and cost
54
55. Chun-Feng Chen NCTU_IEE - PCS Lab
F.(1) Different Basic SRAM Modules
- Area
55
For a wide range of sizes, all these basic SRAM
modules pose very similar performance and cost
56. Chun-Feng Chen NCTU_IEE - PCS Lab
F.(1) Different Basic SRAM Modules
- Power
l For a widely range of sizes, power
consumption: 2RW > 1R1W/2R > 1R1W
l AMM 2RW
2RW + 1R1W (power)
56
57. Chun-Feng Chen NCTU_IEE - PCS Lab
F.(2) AMM Designs with Higher Port
Counts Circuit-level Modules - Latency
l CMM could outperform AMM in certain configurations
l Can we attain better performance by properly choosing
SRAM modules?
l Apply three different SRAM modules (2RW, 2R2W, 2R4W),
we use 4R2W as an example
l For 4R2W, AMM with 2RW SRAMs attains 6.08% to 2.032x
faster latencies from 1K to 16M than AMM with 2R2W
l This is because latency of HB-NTX-RdWr is mainly
determined by the SRAM module, and 2RW is faster (than
2R2W and 2R4W)
57
58. Chun-Feng Chen NCTU_IEE - PCS Lab
F.(2) AMM Designs with Higher Port
Counts Circuit-level Modules - Area
l But using more complex SRAM does provide
benefit on area and power
l AMM
(e.g. 4R2W: 54
2 )
l For 4R2W, AMM with 2R2W SRAM attains
30.28% to 71.43% smaller area from 1K to 16M
than AMM with 2RW
58
59. Chun-Feng Chen NCTU_IEE - PCS Lab
F.(2) AMM Designs with Higher Port
Counts Circuit-level Modules – Power
59
l But using more complex SRAM does provide
benefit on area and power
l AMM
(e.g. 4R2W:
54 2 )
l For 4R2W, AMM with 2R2W SRAM attains
2.52x to 3.42x lower power from 1K to 16M
than AMM with 2RW
61. Summary of Experiments on AMM
Studies
Chun-Feng Chen NCTU_IEE - PCS Lab
• AMM does attain superior performance (latency/ area/ power) than CMM, the
benefits become more significant for designs with more ports and greater depth
• The performance of AMM is mainly determined by the algorithmic logics when
memory depth is shallow. The building SRAM modules will become the main
performance factor for memory with great depth.
• Non-table-based AMMs have shorter latencies when compared with table-based
designs. Table-based AMMs pose smaller area and lower power consumption than
non-table-based AMMs
• Proper banking structure would enhance the performance while excessively
aggressive banking could induce significant overhead and performance hit
• Choosing proper SRAM with higher port counts as building modules could enhance
the performance (area/ power) of AMM designs
61
62. Conclusions
Chun-Feng Chen NCTU_IEE - PCS Lab
• Most of the previous works of AMM were conducted on FPGA-based
platforms, is implemented by composing multiple BRAMs and logic slices
LUT (lookup table)
• This thesis aims to comprehensive analysis and exploration the algorithmic
multi-ported memory on ASICs
• Different basic SRAM modules
• Scalability with memory depths and number of ports for AMM designs
• Applying banking structures for AMM designs
• Circuit-level schemes vs. Algorithmic schemes for different port configures
• Choosing proper SRAM modules with higher port counts can enhance the
performance of AMM designs
62
65. Chun-Feng Chen NCTU_IEE - PCS Lab
References
•[1] Abdel-Hafeez, Saleh M., and Anas S. Matalkah. "CMOS eight-transistor memory cell for low-dynamic-power high-speed embedded
SRAM." Journal of Circuits, Systems, and Computers 17.05 (2008): 845-863.
•[2] Bhagyalakshmi, I. V., Ravi Teja, and Madhan Mohan. "Design and VLSI Simulation of SRAM Memory Cells for Multi-ported SRAM’s." (2014).
•[3] Rivest, Ronald L., and Lance A. Glasser. A Fast-Multiport Memory Based on Single-Port Memory Cells. No. MIT/LCS/TM-455. MASSACHUSETTS
INST OF TECH CAMBRIDGE LAB FOR COMPUTER SCIENCE, 1991.
•[4] Park, Seon-yeong, et al. "CFLRU: a replacement algorithm for flash memory." Proceedings of the 2006 international conference on Compilers,
architecture and synthesis for embedded systems. ACM, 2006.
•[5] Synopsys Design Compiler User Guide Version X-2005.09. [Online] Available:
http://beethoven.ee.ncku.edu.tw/testlab/course/VLSIdesign_course/course_96/Tool/Design_Compiler%20_User_Guide.pdf
•[6] Synopsys Design Compiler Optimization Reference Manual Version D-2010.03. [Online] Available: http://cleroux.vvv.enseirb-
matmeca.fr/EN219/doc/dcrmo.pdf
•[7] LaForest, Charles Eric, and J. Gregory Steffan. "Efficient Multi-ported Memories for FPGAs." Proceedings of the 18th annual ACM/SIGDA
international symposium on Field programmable gate arrays (FPGA), pp. 41-50, ACM, 2010.
•[8] Charles Eric LaForest, Ming Gang Liu, Emma Rae Rapati, and J. Gregory Steffan. "Multi-ported Memories for FPGAs via XOR," In Proceedings of the
20th annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), pp. 209–218, ACM, 2012.
•[9] Charles Eric Laforest, Zimo Li, Tristan O'rourke, Ming G. Liu, and J. Gregory Steffan. "Composing Multi-Ported Memories on FPGAs," in
Proceedings of the ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol.7, issue 3, article no. 16, 2014.
•[10] Lin, Jiun-Liang, and Bo-Cheng Charles Lai. "BRAM Efficient Multi-ported Memory on FPGA." VLSI Design, Automation and Test (VLSI-DAT), 2015
International Symposium on. IEEE, 2015.
•[11] Lai, Bo-Cheng Charles, and Jiun-Liang Lin. "Efficient Designs of Multiported Memory on FPGA." IEEE Transactions on Very Large Scale Integration
(VLSI) Systems (2016).
65
66. Chun-Feng Chen NCTU_IEE - PCS Lab
References
•[12] Lai, Bo-Cheng Charles, and Kun-Hua Huang. "An Efficient Hierarchical Banking Structure for Algorithmic Multiported Memory on FPGA." IEEE
Transactions on Very Large Scale Integration (VLSI) Systems (2017).
•[13] S. Iyer and D. Chuang. (Jan. 2012) “Algorithmic Memory Brings an Order of Magnitude Performance Increase to Next Generation SoC Memories
“DesignCon, accessed on Jun. 22, 2017. [Online] Available: http://www.yuba.stanford.edu/sundaes/Papers/DesignCon-AlgMem.pdf
•[14] Tse, David N. C., Pramod Viswanath, and Lizhong Zheng. "Diversity-multiplexing tradeoff in multiple-access channels." IEEE Transactions on
Information Theory 50.9 (2004): 1859-1874.
•[15] Ping, Li, et al. "Interleave division multiple-access." IEEE Transactions on Wireless Communications 5.4 (2006): 938-947.
•[16] Suhendra, Vivy, Chandrashekar Raghavan, and Tulika Mitra. "Integrated scratchpad memory optimization and task scheduling for MPSoC
architectures." Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems. ACM, 2006.
•[17] Iyer, Sundar, and Shang-Tse Chuang. "High speed memory systems and methods for designing hierarchical memory systems." U.S. Patent
Application No. 12/806,631.
•[18] Wilton, Steven JE, and Norman P. Jouppi. "CACTI: An enhanced cache access and cycle time model." IEEE Journal of Solid-State Circuits 31.5
(1996): 677-688.
•[19] Muralimanohar, Naveen, Rajeev Balasubramonian, and Norman P. Jouppi. "CACTI 6.0: A tool to model large caches." HP Laboratories (2009):
22-31.
•[20] Muralimanohar, Naveen, Rajeev Balasubramonian, and Norm Jouppi. "Optimizing NUCA organizations and wiring alternatives for large caches
with CACTI 6.0." Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2007.
•[21] Muralimanohar, Naveen, Rajeev Balasubramonian, and Norman P. Jouppi. "Architecting efficient interconnects for large caches with CACTI
6.0." IEEE micro 28.1 (2008).
•[22] Thoziyoor, Shyamkumar, et al. CACTI 5.1. Technical Report HPL-2008-20, HP Labs, 2008.
•[23] Synopsys Design Compiler Standard Cell Library, including TSMC, UMC and SMIC. [Online] Available:
https://www.synopsys.com/dw/ipdir.php?ds=dwc_standard_cell
66
67. Chun-Feng Chen NCTU_IEE - PCS Lab
References
•[24] TSMC Standard Cell Library (including 45nm, 90nm advanced technology) Description Name. [Online] Available: http://www.europractice-
ic.com/libraries_TSMC.php
•[25] Bo-Cheng Charles Lai, Jiun-Liang Lin, Kun-Hua Huang, and Kuo-Cheng Lu. "Method for accessing multi-port memory module, method for
increasing write ports of memory module and associated memory controller." U.S. Patent Application No. 15/098,330.
•[26] Bo-Cheng Charles Lai, Jiun-Liang Lin, and Kuo-Cheng Lu. "Method for accessing multi-port memory module and associated memory controller."
U.S. Patent Application No. 15/098,336.
•[27] Tseng, Jessica H., and Krste Asanović. "Banked multiported register files for high-frequency superscalar microprocessors." ACM SIGARCH
Computer Architecture News. Vol. 31. No. 2. ACM, 2003.
•[28] Kim, John. "Low-cost router microarchitecture for on-chip networks." Proceedings of the 42nd Annual IEEE/ACM International Symposium on
Microarchitecture. ACM, 2009.
•[29] Gupta, Pankaj, Steven Lin, and Nick McKeown. "Routing lookups in hardware at memory access speeds." INFOCOM'98. Seventeenth Annual Joint
Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE. Vol. 3. IEEE, 1998.
•[30] Hughes, John H. "Routing table lookup implemented using M-trie having nodes duplicated in multiple memory Banks." U.S. Patent No.
6,308,219. 23 Oct. 2001.
•[31] McAuley, Anthony J., Paul F. Tsuchiya, and Daniel V. Wilson. "Fast multilevel hierarchical routing table lookup using content addressable
memory." U.S. Patent No. 5,386,413. 31 Jan. 1995.
•[32] Teitenberg, Tim, and Bikram Singh Bakshi. "Efficient memory management for channel drivers in next generation I/O system." U.S. Patent No.
6,421,769. 16 Jul. 2002.
•[33] Treleaven, Philip C., David R. Brownbridge, and Richard P. Hopkins. "Data-driven and demand-driven computer architecture." ACM Computing
Surveys (CSUR) 14.1 (1982): 93-143.
•[34] Peng, Zebo, and Krzysztof Kuchcinski. "Automated transformation of algorithms into register-transfer level implementations." IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems 13.2 (1994): 150-166.
67
68. Chun-Feng Chen NCTU_IEE - PCS Lab
References
•[35] Keshav, Srinivasan, and Rosen Sharma. "Issues and trends in router design." IEEE Communications magazine 36.5 (1998): 144-151.
•[36] Tullsen, Dean M., et al. "Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor." ACM
SIGARCH Computer Architecture News. Vol. 24. No. 2. ACM, 1996.
•[37] Xilinx 7 Series FPGAs Configurable Logic Block User Guide. [Online] Available:
http://www.xilinx.com/support/documentation/user_guides/ug474_7Series_CLB.pdf
•[38] Fetzer, E. S., Gibson, M., Klein, A., Calick, N., Zhu, C., Busta, E., & Mohammad, B. (2002). "A fully bypassed six-issue integer datapath and register
file on the Itanium-2 microprocessor." IEEE Journal of Solid-State Circuits Conference, vol. 1, Feb. 2002, pp. 420-478.
•[39] Bajwa, H., and X. Chen. "Low-Power High-Performance and Dynamically Configured Multi-port Cache Memory Architecture." Electrical
Engineering, 2007. ICEE'07. International Conference on. IEEE, April, 2007.
•[40] S. Ben-David, A. Borodin, R. Karp, G. Tardos, and A. Wigderson, “On the Power of Randomization in On-line Algorithms”, New York: Springer,
1994.
68