This document discusses architectural synthesis of DSP structured datapaths. It provides an overview of the architectural level synthesis problem and subtasks like scheduling, binding, and architecture optimization. The document describes using novel mathematical programming formulations to optimize performance and structural complexity for DSP synthesis. It also discusses techniques to improve the solution time for integer linear programming formulations, and provides results for typical high-level synthesis benchmarks.
Iaetsd multioperand redundant adders on fpg asIaetsd Iaetsd
This paper presents efficient implementations of redundant multi-operand adders on FPGAs. Previous work avoided redundant adders on FPGAs due to the efficient carry propagate adders (CPAs) and area overhead of redundant adders. The paper proposes carry-save compressor tree approaches that achieve fast critical paths independent of bit width with little to no area overhead compared to CPA trees. It presents a classic carry-save compressor tree and a novel linear array structure that efficiently uses fast carry chains. Compared to binary and ternary CPA trees, the approaches achieve speedups of up to 3.81 times for 64-bit width additions.
Vlsi design process for low power design methodology using reconfigurable fpgaeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
An Offline Hybrid IGP/MPLS Traffic Engineering Approach under LSP ConstraintsEM Legacy
This document proposes a novel hybrid IGP/MPLS traffic engineering method based on genetic algorithms to handle long or medium-term traffic variations. The method treats the maximum number of hops an LSP may take and the number of LSPs applied solely to improve routing as constraints. Results comparing this hybrid approach to pure IGP routing and full mesh MPLS with and without flow splitting on the German scientific network and other networks are presented.
The document discusses the steps in the logic synthesis process from RTL to optimized gate-level netlist. It includes:
1) RTL description is converted to an internal representation
2) Logic is optimized to remove redundancy
3) Technology mapping implements the representation using cells from a technology library
The document also discusses floor planning, which determines routing areas by placing blocks/macros, and placement which places standard cells in rows to minimize area and interconnect cost.
Ternary content addressable memory for longest prefix matching based on rando...TELKOMNIKA JOURNAL
Conventional ternary content addressable memory (TCAM) provides access to stored data, which consists of '0', '1' and ‘don't care’, and outputs the matched address. Content lookup in TCAM can be done in a single cycle, which makes it very important in applications such as address lookup and deep-packet inspection. This paper proposes an improved TCAM architecture with fast update functionality. To support longest prefix matching (LPM), LPM logic are needed to the proposed TCAM. The latency of the proposed LPM logic is dependent on the number of matching addresses in address prefix comparison. In order to improve the throughput, parallel LPM logic is added to improve the throughput by 10× compared to the one without. Although with resource overhead, the cost of throughput per bit is less as compared to the one without parallel LPM logic.
Label encoding algorithm for MPLS Segment Routing - Nca2016Rabah GUEDREZ
The document proposes algorithms to more efficiently encode Segment Routing paths in MPLS networks to address the limitation of maximum stack depth (MSD). It introduces the problem that strictly encoding paths as a label stack can result in some paths being unusable if they exceed the MSD. It then presents the Segment Routing Paths Label Encoding Algorithm (SR-LEA) which can encode paths using fewer labels by mapping multiple hops to a single label when possible. Simulation results show SR-LEA increases the percentage of usable paths. The document concludes the algorithms help mitigate the impact of the MSD limitation and enable wider deployment of Segment Routing.
Iaetsd multioperand redundant adders on fpg asIaetsd Iaetsd
This paper presents efficient implementations of redundant multi-operand adders on FPGAs. Previous work avoided redundant adders on FPGAs due to the efficient carry propagate adders (CPAs) and area overhead of redundant adders. The paper proposes carry-save compressor tree approaches that achieve fast critical paths independent of bit width with little to no area overhead compared to CPA trees. It presents a classic carry-save compressor tree and a novel linear array structure that efficiently uses fast carry chains. Compared to binary and ternary CPA trees, the approaches achieve speedups of up to 3.81 times for 64-bit width additions.
Vlsi design process for low power design methodology using reconfigurable fpgaeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
An Offline Hybrid IGP/MPLS Traffic Engineering Approach under LSP ConstraintsEM Legacy
This document proposes a novel hybrid IGP/MPLS traffic engineering method based on genetic algorithms to handle long or medium-term traffic variations. The method treats the maximum number of hops an LSP may take and the number of LSPs applied solely to improve routing as constraints. Results comparing this hybrid approach to pure IGP routing and full mesh MPLS with and without flow splitting on the German scientific network and other networks are presented.
The document discusses the steps in the logic synthesis process from RTL to optimized gate-level netlist. It includes:
1) RTL description is converted to an internal representation
2) Logic is optimized to remove redundancy
3) Technology mapping implements the representation using cells from a technology library
The document also discusses floor planning, which determines routing areas by placing blocks/macros, and placement which places standard cells in rows to minimize area and interconnect cost.
Ternary content addressable memory for longest prefix matching based on rando...TELKOMNIKA JOURNAL
Conventional ternary content addressable memory (TCAM) provides access to stored data, which consists of '0', '1' and ‘don't care’, and outputs the matched address. Content lookup in TCAM can be done in a single cycle, which makes it very important in applications such as address lookup and deep-packet inspection. This paper proposes an improved TCAM architecture with fast update functionality. To support longest prefix matching (LPM), LPM logic are needed to the proposed TCAM. The latency of the proposed LPM logic is dependent on the number of matching addresses in address prefix comparison. In order to improve the throughput, parallel LPM logic is added to improve the throughput by 10× compared to the one without. Although with resource overhead, the cost of throughput per bit is less as compared to the one without parallel LPM logic.
Label encoding algorithm for MPLS Segment Routing - Nca2016Rabah GUEDREZ
The document proposes algorithms to more efficiently encode Segment Routing paths in MPLS networks to address the limitation of maximum stack depth (MSD). It introduces the problem that strictly encoding paths as a label stack can result in some paths being unusable if they exceed the MSD. It then presents the Segment Routing Paths Label Encoding Algorithm (SR-LEA) which can encode paths using fewer labels by mapping multiple hops to a single label when possible. Simulation results show SR-LEA increases the percentage of usable paths. The document concludes the algorithms help mitigate the impact of the MSD limitation and enable wider deployment of Segment Routing.
OPTIMIZATION OF IP NETWORKS IN VARIOUS HYBRID IGP/MPLS ROUTING SCHEMESEM Legacy
The document discusses optimization of traffic engineering in hybrid IGP/MPLS networks using a genetic algorithm approach. It formulates the problem and introduces notation for the network topology, link capacities, traffic demands, and label switched paths (LSPs). It then describes three hybrid routing schemes - basic IGP shortcut, IGP shortcut, and overlay - that combine IGP routing with MPLS. The document proposes using a genetic algorithm to solve the optimization problem. It describes encoding potential solutions as chromosomes, where each value represents an LSP assignment for a traffic flow. The algorithm aims to minimize network congestion by evolving populations of chromosomes over iterations to find optimal LSP configurations. Results are presented for the German scientific network topology.
Study about Locator/Identifier Separation Protocol (LISP)Assia Bakrim
This document provides an overview of the Locator/Identifier Separation Protocol (LISP). LISP aims to address issues with internet mobility and scalability by separating a device's identifier and locator. It introduces Endpoint Identifiers (EIDs) that serve as a device's identity and Routing Locators (RLOCs) that indicate where the device is attached. The mapping system maps EIDs to RLOCs to allow packets to be routed correctly. LISP has seen growing deployment worldwide and offers advantages like incremental deployment, network virtualization, and reducing the size of global routing tables. However, challenges remain around managing the mapping system and ensuring reachability.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...IJMTST Journal
This document compares the performance of four types of parallel prefix adders (Kogge-Stone, sparse Kogge-Stone, spanning tree, and Brent Kung) implemented on a Xilinx Spartan 3E FPGA. It finds that the parallel prefix adders have better performance than ripple carry and carry skip adders for widths above 56 bits. For the FPGA implementation, the Brent Kung adder requires the smallest area while the Kogge-Stone adder is largest. Simulation results show the Brent Kung adder has the fastest delay. Measurements using logic analysis equipment confirm the simulation results.
2 - Generation of PSK signal using non linear devices via MATLAB (presented i...Youness Lahdili
This document discusses generating a QPSK signal using MATLAB. It begins with an introduction to PSK modulation techniques and QPSK. It then describes the simulation design process in MATLAB, including representing the QPSK signal using I and Q components, and generating the signal using formulas programmed in MATLAB code. The code generates a QPSK modulated signal from binary input data along with plots of the original data and modulated signal. It also describes representing the generated QPSK signal using scatter plots in MATLAB to visualize the constellation.
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORMVLSICS Design
This paper is devoted to the design of dual core crypto processor for executing both Prime field and binary field instructions. The proposed design is specifically optimized for Field programmable gate array (FPGA) platform. Combination of two different field (prime field GF(p) and Binary field GF(2m)) instructions execution is analysed.The design is implemented in Spartan 3E and virtex5. Both the performance results are compared. The implementation result shows the execution of parallelism using dual field instructions
Implementation of High Throughput Radix-16 FFT ProcessorIJMER
The extension of radix-4 algorithm to radix-16 to achieve the high throughput of 2.59 giga-samples/s for WPAN’s.We are also reformulating radix-16 algorithm to achieve low-complexity and
low area cost and high performance. Radix-16 FFT is obtained by cascaded the radix -4 butterfly
units. It facilitates low-complexity realization of radix-16 butterfly operation and high operation speed
due to its optimized pipelined structure. Besides, a new three-stage multiplier for twiddle factor
multiplication is also proposed, which has lower area and power consumption than conventional
complex multipliers
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
High –Speed Implementation of Design and Analysis by Using Parallel Prefix Ad...IOSRJECE
The binary adder is the critical element in most digital circuit designs including the digital signal processors (DSP) and microprocessor data unit path. As such as extensive research continues to be focused on improving the power, delay, improvement of the adder. The design and analysis of the parallel prefix adders (carry select adders) is to be implemented by using Verilog. In VLSI implementations, parallel prefix adders are very high speed performance. Binary adders are one of the most essential logic elements within a digital system. Therefore, binary addition is essential that any improvement in binary addition can result in a performance boost for any computing system and hence, help improve the performance of the entire system. Parallel-prefix adders (also known as carry-tree adders) are known to have the best performance in VLSI designs. This paper investigates (the Kogge-Stone, sparse Kogge-Stone, Ladner fischer adder, Brent-Kung adder) and compares them to the simple Ripple Carry Adder (RCA) for high number of binary bits.
Hadoop classes in mumbai
best android classes in mumbai with job assistance.
our features are:
expert guidance by it industry professionals
lowest fees of 5000
practical exposure to handle projects
well equiped lab
after course resume writing guidance
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
JPJ1433 Cost-Effective Resource Allocation of Overlay Routing Relay Nodeschennaijp
This document proposes an algorithmic framework to efficiently allocate resources for overlay routing. It formulates the overlay routing resource allocation problem to find the minimum number of overlay nodes needed to satisfy certain routing properties. The problem is shown to be NP-hard, and an approximation algorithm is presented. In experiments, the approach finds near-optimal placements of less than 100 nodes to enable shortest path routing between autonomous systems, reducing average path lengths by 40%. It can also improve TCP performance and reduce delays for voice applications.
Moolle fan-out control for scalable distributed data storesSungJu Cho
Many Online Social Networks horizontally partition data across data stores. This allows the addition of server nodes to increase capacity and throughput. For single key lookup queries such as computing a member's 1st degree connections, clients need to generate only one request to one data store. However, for multi key lookup queries such as computing a 2nd degree network, clients need to generate multiple requests to multiple data stores. The number of requests to fulfill the multi key lookup queries grows in relation to the number of partitions. Increasing the number of server nodes in order to increase capacity also increases the number of requests between the client and data stores. This may increase the latency of the query response time because of network congestion, tail-latency, and CPU bounding. Replication based partitioning strategies can reduce the number of requests in the multi key lookup queries. However, reducing the number of requests in a query can degrade the performance of certain queries where processing, computing, and filtering can be done by the data stores. A better system would provide the capability of controlling the number of requests in a query. This paper presents Moolle, a system of controlling the number of requests in queries to scalable distributed data stores. Moolle has been implemented in the LinkedIn distributed graph service that serves hundreds of thousands of social graph traversal queries per second. We believe that Moolle can be applied to other distributed systems that handle distributed data processing with a high volume of variable-sized requests.
This document describes a multi-path routing algorithm for IP networks based on flow optimization. It presents an intra-domain routing algorithm that uses multi-commodity flow optimization to enable load-sensitive forwarding over multiple paths without being constrained by traditional routing protocols like OSPF. The key idea is to aggregate all traffic destined for the same egress node into one commodity during optimization, reducing the number of commodities significantly. This makes the computation tractable and allows forwarding based on destination addresses.
Implementation and Design of High Speed FPGA-based Content Addressable Memoryijsrd.com
CAM stands for content addressable memory. It is a special type of computer memory used in very high speed searching application. A CAM is a memory that implements the high speed lookup-table function in a single clock cycle using dedicated comparison circuitry. It is also known as associative memory or associative array although the last term used for a programming data structure. Unlike standard computer memory (RAM) in which user supplies the memory address and the RAM returns the data word stored in that memory address, CAM is designed in such a way that user supplies data word and CAM searches its entire memory to see if that data word stored anywhere in it. If the data word is found, the CAM returns a list of one or more storage address where the word was found. This design coding, simulation, logic synthesis and implementation will be done using various EDA tools.
IRJET- A Survey on Reconstruct Structural Design of FPGAIRJET Journal
This document summarizes research on reconstructing the structural design of field-programmable gate arrays (FPGAs). It discusses how logic block complexity, interconnect structure, hardwired logic blocks, and data path implementation can impact the area and speed of FPGAs. The document analyzes past studies that examined how varying the number of lookup table (LUT) inputs, routing flexibility between blocks, and use of hardwired connections affected the routability and timing of mapped circuits. It also describes how FPGAs can be optimized for logic emulation applications by multiplexing logic units over time through memory-based architectures. In general, the document reviews FPGA architectural parameters and how researchers have iteratively improved designs through simulation and
This summarizes a fast re-route method to find an alternate path after a link failure, before the interior gateway protocol has reconverged. The method selects the next hop among a source node's neighbors based on which has the lowest number of visits (multiplicity) and shortest estimated distance to the destination. It is proven to always find an alternate path if one exists. The method improves over loop-free alternate approaches by not requiring tunnels. It can find paths for simple cases like a square topology where LFA fails.
This document proposes and analyzes two new C-RAN network architectures that utilize SDN and centralized baseband processing. The first architecture (D-MME-CRAN) distributes the mobility management entity (MME) function within each C-RAN, while the second (C-MME-CRAN) centralizes the MME. Both architectures are evaluated based on control signaling load across five procedures when varying cell area and tracking area size. Results show the D-MME-CRAN performs best for small tracking areas, while C-MME-CRAN is better for larger areas. Overall, the proposed architectures reduce signaling load compared to legacy networks and other SDN-based approaches.
PMU-Based Real-Time Damping Control System Software and Hardware Architecture...Luigi Vanfretti
Poster Presentation at the IEEE PES General Meeting. Low-frequency, electromechanically induced, inter- area oscillations are of concern in the continued stability of inter- connected power systems. Wide Area Monitoring, Protection and Control (WAMPAC) systems based on wide-area measurements such as synchrophasor (C37.118) data can be exploited to address the inter-area oscillation problem. This work develops a hardware prototype of a synchrophasor-based oscillation damping control system. A Compact Reconfigurable Input Output (cRIO) con- troller from National Instruments is used to implement the real- time prototype. This paper presents the design process followed for the development of the software architecture. The design method followed a three step process of design proposal, design refinement and finally attempted implementation. The goals of the design, the challenges faced and the refinements necessary are presented. The design implemented is tested and validated on OPAL-RT’s eMEGASIM real-time simulation platform and a brief discussion of the experimental results is included.
This document summarizes a presentation given at the 4th International Conference on Advances in Energy Research titled "Pinch Analysis for MultiDimensional Sustainable Energy Systems Planning". It discusses how pinch analysis, a process integration technique, can be applied to model multi-objective optimization problems in sustainable energy system planning by considering factors like energy return on investment, cost, and carbon emissions. A case study applying this approach to the energy system in the Philippines is presented, showing a Pareto optimal front of solutions balancing these objectives.
The document explores design processes used by architectural and engineering organizations. It begins by stating that all such organizations have design processes, whether documented or informal. It then indicates it will examine some common design processes from simple to more professional approaches. Finally, the document requests feedback from relevant disciplines like architecture, structure, electrical, and more on a tender stage design process for building projects that the author has created in Microsoft Project to integrate all necessary disciplines.
OPTIMIZATION OF IP NETWORKS IN VARIOUS HYBRID IGP/MPLS ROUTING SCHEMESEM Legacy
The document discusses optimization of traffic engineering in hybrid IGP/MPLS networks using a genetic algorithm approach. It formulates the problem and introduces notation for the network topology, link capacities, traffic demands, and label switched paths (LSPs). It then describes three hybrid routing schemes - basic IGP shortcut, IGP shortcut, and overlay - that combine IGP routing with MPLS. The document proposes using a genetic algorithm to solve the optimization problem. It describes encoding potential solutions as chromosomes, where each value represents an LSP assignment for a traffic flow. The algorithm aims to minimize network congestion by evolving populations of chromosomes over iterations to find optimal LSP configurations. Results are presented for the German scientific network topology.
Study about Locator/Identifier Separation Protocol (LISP)Assia Bakrim
This document provides an overview of the Locator/Identifier Separation Protocol (LISP). LISP aims to address issues with internet mobility and scalability by separating a device's identifier and locator. It introduces Endpoint Identifiers (EIDs) that serve as a device's identity and Routing Locators (RLOCs) that indicate where the device is attached. The mapping system maps EIDs to RLOCs to allow packets to be routed correctly. LISP has seen growing deployment worldwide and offers advantages like incremental deployment, network virtualization, and reducing the size of global routing tables. However, challenges remain around managing the mapping system and ensuring reachability.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...IJMTST Journal
This document compares the performance of four types of parallel prefix adders (Kogge-Stone, sparse Kogge-Stone, spanning tree, and Brent Kung) implemented on a Xilinx Spartan 3E FPGA. It finds that the parallel prefix adders have better performance than ripple carry and carry skip adders for widths above 56 bits. For the FPGA implementation, the Brent Kung adder requires the smallest area while the Kogge-Stone adder is largest. Simulation results show the Brent Kung adder has the fastest delay. Measurements using logic analysis equipment confirm the simulation results.
2 - Generation of PSK signal using non linear devices via MATLAB (presented i...Youness Lahdili
This document discusses generating a QPSK signal using MATLAB. It begins with an introduction to PSK modulation techniques and QPSK. It then describes the simulation design process in MATLAB, including representing the QPSK signal using I and Q components, and generating the signal using formulas programmed in MATLAB code. The code generates a QPSK modulated signal from binary input data along with plots of the original data and modulated signal. It also describes representing the generated QPSK signal using scatter plots in MATLAB to visualize the constellation.
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORMVLSICS Design
This paper is devoted to the design of dual core crypto processor for executing both Prime field and binary field instructions. The proposed design is specifically optimized for Field programmable gate array (FPGA) platform. Combination of two different field (prime field GF(p) and Binary field GF(2m)) instructions execution is analysed.The design is implemented in Spartan 3E and virtex5. Both the performance results are compared. The implementation result shows the execution of parallelism using dual field instructions
Implementation of High Throughput Radix-16 FFT ProcessorIJMER
The extension of radix-4 algorithm to radix-16 to achieve the high throughput of 2.59 giga-samples/s for WPAN’s.We are also reformulating radix-16 algorithm to achieve low-complexity and
low area cost and high performance. Radix-16 FFT is obtained by cascaded the radix -4 butterfly
units. It facilitates low-complexity realization of radix-16 butterfly operation and high operation speed
due to its optimized pipelined structure. Besides, a new three-stage multiplier for twiddle factor
multiplication is also proposed, which has lower area and power consumption than conventional
complex multipliers
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
High –Speed Implementation of Design and Analysis by Using Parallel Prefix Ad...IOSRJECE
The binary adder is the critical element in most digital circuit designs including the digital signal processors (DSP) and microprocessor data unit path. As such as extensive research continues to be focused on improving the power, delay, improvement of the adder. The design and analysis of the parallel prefix adders (carry select adders) is to be implemented by using Verilog. In VLSI implementations, parallel prefix adders are very high speed performance. Binary adders are one of the most essential logic elements within a digital system. Therefore, binary addition is essential that any improvement in binary addition can result in a performance boost for any computing system and hence, help improve the performance of the entire system. Parallel-prefix adders (also known as carry-tree adders) are known to have the best performance in VLSI designs. This paper investigates (the Kogge-Stone, sparse Kogge-Stone, Ladner fischer adder, Brent-Kung adder) and compares them to the simple Ripple Carry Adder (RCA) for high number of binary bits.
Hadoop classes in mumbai
best android classes in mumbai with job assistance.
our features are:
expert guidance by it industry professionals
lowest fees of 5000
practical exposure to handle projects
well equiped lab
after course resume writing guidance
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
JPJ1433 Cost-Effective Resource Allocation of Overlay Routing Relay Nodeschennaijp
This document proposes an algorithmic framework to efficiently allocate resources for overlay routing. It formulates the overlay routing resource allocation problem to find the minimum number of overlay nodes needed to satisfy certain routing properties. The problem is shown to be NP-hard, and an approximation algorithm is presented. In experiments, the approach finds near-optimal placements of less than 100 nodes to enable shortest path routing between autonomous systems, reducing average path lengths by 40%. It can also improve TCP performance and reduce delays for voice applications.
Moolle fan-out control for scalable distributed data storesSungJu Cho
Many Online Social Networks horizontally partition data across data stores. This allows the addition of server nodes to increase capacity and throughput. For single key lookup queries such as computing a member's 1st degree connections, clients need to generate only one request to one data store. However, for multi key lookup queries such as computing a 2nd degree network, clients need to generate multiple requests to multiple data stores. The number of requests to fulfill the multi key lookup queries grows in relation to the number of partitions. Increasing the number of server nodes in order to increase capacity also increases the number of requests between the client and data stores. This may increase the latency of the query response time because of network congestion, tail-latency, and CPU bounding. Replication based partitioning strategies can reduce the number of requests in the multi key lookup queries. However, reducing the number of requests in a query can degrade the performance of certain queries where processing, computing, and filtering can be done by the data stores. A better system would provide the capability of controlling the number of requests in a query. This paper presents Moolle, a system of controlling the number of requests in queries to scalable distributed data stores. Moolle has been implemented in the LinkedIn distributed graph service that serves hundreds of thousands of social graph traversal queries per second. We believe that Moolle can be applied to other distributed systems that handle distributed data processing with a high volume of variable-sized requests.
This document describes a multi-path routing algorithm for IP networks based on flow optimization. It presents an intra-domain routing algorithm that uses multi-commodity flow optimization to enable load-sensitive forwarding over multiple paths without being constrained by traditional routing protocols like OSPF. The key idea is to aggregate all traffic destined for the same egress node into one commodity during optimization, reducing the number of commodities significantly. This makes the computation tractable and allows forwarding based on destination addresses.
Implementation and Design of High Speed FPGA-based Content Addressable Memoryijsrd.com
CAM stands for content addressable memory. It is a special type of computer memory used in very high speed searching application. A CAM is a memory that implements the high speed lookup-table function in a single clock cycle using dedicated comparison circuitry. It is also known as associative memory or associative array although the last term used for a programming data structure. Unlike standard computer memory (RAM) in which user supplies the memory address and the RAM returns the data word stored in that memory address, CAM is designed in such a way that user supplies data word and CAM searches its entire memory to see if that data word stored anywhere in it. If the data word is found, the CAM returns a list of one or more storage address where the word was found. This design coding, simulation, logic synthesis and implementation will be done using various EDA tools.
IRJET- A Survey on Reconstruct Structural Design of FPGAIRJET Journal
This document summarizes research on reconstructing the structural design of field-programmable gate arrays (FPGAs). It discusses how logic block complexity, interconnect structure, hardwired logic blocks, and data path implementation can impact the area and speed of FPGAs. The document analyzes past studies that examined how varying the number of lookup table (LUT) inputs, routing flexibility between blocks, and use of hardwired connections affected the routability and timing of mapped circuits. It also describes how FPGAs can be optimized for logic emulation applications by multiplexing logic units over time through memory-based architectures. In general, the document reviews FPGA architectural parameters and how researchers have iteratively improved designs through simulation and
This summarizes a fast re-route method to find an alternate path after a link failure, before the interior gateway protocol has reconverged. The method selects the next hop among a source node's neighbors based on which has the lowest number of visits (multiplicity) and shortest estimated distance to the destination. It is proven to always find an alternate path if one exists. The method improves over loop-free alternate approaches by not requiring tunnels. It can find paths for simple cases like a square topology where LFA fails.
This document proposes and analyzes two new C-RAN network architectures that utilize SDN and centralized baseband processing. The first architecture (D-MME-CRAN) distributes the mobility management entity (MME) function within each C-RAN, while the second (C-MME-CRAN) centralizes the MME. Both architectures are evaluated based on control signaling load across five procedures when varying cell area and tracking area size. Results show the D-MME-CRAN performs best for small tracking areas, while C-MME-CRAN is better for larger areas. Overall, the proposed architectures reduce signaling load compared to legacy networks and other SDN-based approaches.
PMU-Based Real-Time Damping Control System Software and Hardware Architecture...Luigi Vanfretti
Poster Presentation at the IEEE PES General Meeting. Low-frequency, electromechanically induced, inter- area oscillations are of concern in the continued stability of inter- connected power systems. Wide Area Monitoring, Protection and Control (WAMPAC) systems based on wide-area measurements such as synchrophasor (C37.118) data can be exploited to address the inter-area oscillation problem. This work develops a hardware prototype of a synchrophasor-based oscillation damping control system. A Compact Reconfigurable Input Output (cRIO) con- troller from National Instruments is used to implement the real- time prototype. This paper presents the design process followed for the development of the software architecture. The design method followed a three step process of design proposal, design refinement and finally attempted implementation. The goals of the design, the challenges faced and the refinements necessary are presented. The design implemented is tested and validated on OPAL-RT’s eMEGASIM real-time simulation platform and a brief discussion of the experimental results is included.
This document summarizes a presentation given at the 4th International Conference on Advances in Energy Research titled "Pinch Analysis for MultiDimensional Sustainable Energy Systems Planning". It discusses how pinch analysis, a process integration technique, can be applied to model multi-objective optimization problems in sustainable energy system planning by considering factors like energy return on investment, cost, and carbon emissions. A case study applying this approach to the energy system in the Philippines is presented, showing a Pareto optimal front of solutions balancing these objectives.
The document explores design processes used by architectural and engineering organizations. It begins by stating that all such organizations have design processes, whether documented or informal. It then indicates it will examine some common design processes from simple to more professional approaches. Finally, the document requests feedback from relevant disciplines like architecture, structure, electrical, and more on a tender stage design process for building projects that the author has created in Microsoft Project to integrate all necessary disciplines.
This document describes a new approach for developing a high-level synthesis tool for low power VLSI designs called Gaut_w. Gaut_w is composed of low power modules that are used before an architectural synthesis tool to optimize designs at the behavioral and architectural levels for power savings. The key modules of Gaut_w are high level power estimation, module selection to choose optimal operators and supply voltages, optimization criteria to minimize area and power, and operator assignment to decrease switching activity. Experimental results on discrete wavelet transform algorithms show power savings from using Gaut_w.
This document discusses responsive web design and how it changes the design process. It recommends prioritizing content and identifying content chunks when designing for different screen sizes. Designers should decide on breakpoints and create grid templates for different device widths. The process involves wireframing and designing for both desktop and mobile simultaneously through iteration. Effective collaboration between designers and developers is important when screen sizes are considered.
Logic synthesis with synopsys design compilernaeemtayyab
This document provides an overview of logic synthesis with Synopsys Design Compiler. It discusses the ASIC design flow, logic synthesis process, the Design Compiler tool, and the steps to use Design Compiler including project setup, reading the design, setting constraints, optimizing the design, and analyzing results. The goals of logic synthesis are to convert HDL to an optimized gate-level design given a library and constraints. Design Compiler is used to perform logic synthesis and optimization for area, speed or power.
Human: Thank you, that is a concise 3 sentence summary that captures the key aspects of the document.
Episode 55 : Conceptual Process Synthesis-Design
Process Flowsheet Synthesis: Method to determine a process flowsheet that satisfies all product, operational and other requirements
SAJJAD KHUDHUR ABBAS
Ceo , Founder & Head of SHacademy
Chemical Engineering , Al-Muthanna University, Iraq
Oil & Gas Safety and Health Professional – OSHACADEMY
Trainer of Trainers (TOT) - Canadian Center of Human
Development
Building codes govern the design and construction of buildings to ensure safety and establish standards. Codes have existed for millennia and are updated regularly to reflect advances in technology and materials. The modern building code focuses on occupancy classifications, fire prevention, structural integrity, accessibility, and other life safety issues. Architects and engineers use the building code throughout the design process to ensure their designs meet all applicable requirements.
The document discusses various concepts and methodologies related to software design including design specification modules, design languages like use case diagrams and class diagrams, fundamental design concepts like abstraction and modularity, modular design methods and criteria for evaluation, control terminology, effective modular design principles of high cohesion and low coupling, design heuristics, and ten heuristics for user interface design.
Khaled Almusa is a senior software engineer with over 10 years of experience developing web and mobile applications. He has extensive experience building full-stack applications using technologies like React, Node.js, and MongoDB. Currently, he works at Anthropic where he focuses on AI safety research and the development of Constitutional AI techniques.
This document provides an overview and summary of a course on professional practice for architecture students. The summary includes:
1) An introduction to the course, including information on the instructor, time/location of lectures, prerequisites, and catalog description.
2) An outline of the course requirements, including attendance, assignments, exams, and evaluation methods.
3) A list of the key learning objectives covering topics like the architect's role, project documentation, ethics, licensing, the building process, contracts, economics/finance, and professional organizations.
Architect's Act 1972 of India, Registration of Architects, Practise of Architecture, Standards of Education & traning of an Architect, Council of Architecture
1) The document discusses various fire safety design principles including fire avoidance, detection, growth restriction, containment, control and smoke control.
2) Key elements of fire avoidance include fire zoning, limiting combustible materials and fire load. Fire detection focuses on manual and automatic detection methods. Growth restriction methods center around manual firefighting equipment like extinguishers and sprinklers.
3) Fire containment principles involve compartmentalizing buildings using fire-rated walls and doors to confine fires. Fire control ensures firefighter access to buildings and hydrants.
This document provides an analysis of a proposed development site in Bandar Penawar, Johor, Malaysia. It includes summaries of the site conditions, surrounding land uses, accessibility, and development potential. A concept plan is proposed with clustered residential neighborhoods integrated with commercial areas and recreational parks. The overall theme is "Cluster Garden Living" to promote a balanced living environment that is safe, high quality, integrated with nature, vibrant, and convenient.
Architectural Design 1 Lectures by Dr. Yasser Mahgoub - ProcessGalala University
The document discusses the architectural design process. It describes the typical phases as:
1) Pre-design phase which involves programming to understand user needs.
2) Site analysis to understand the site context and how it relates to the user needs.
3) Schematic design phase where the main concepts of form and space are generated to address the user needs within the site context.
1. Recent advances in silicon technology have enabled more complex system-on-chip designs by allowing for higher densities and frequencies. However, verifying these complex, mixed-signal designs is challenging with traditional methodologies.
2. Currently, designers can simulate mixed-signal designs at various levels of abstraction from transistor-level up to system-level using languages like VHDL-AMS. However, transistor-level simulations are slow while higher-level languages do not support synthesis and have synchronization issues.
3. The document proposes a novel mixed-signal design methodology using a user-defined floating-point library in VHDL compatible with IEEE 754 to model analog operations digitally. This allows modeling and verifying
Devdut Pawaskar is seeking opportunities in VLSI Systems Design starting in December 2016. He has a Master's degree in Electrical and Computer Engineering from Georgia Tech with a focus on VLSI Systems and Digital Design. He has experience with circuit design tools like Synopsys and Cadence. His projects include designing a digital compensator for a Fully Integrated Voltage Regulator in 130nm and 28nm processes, and implementing a 6T-SRAM array and adder in 45nm. He also designed a noise tolerant low power dynamic NOR gate in 45nm.
Cockatrice: A Hardware Design Environment with ElixirHideki Takase
Cockatrice is a hardware design environment that allows designing hardware circuits from Elixir code. It synthesizes Elixir code following the "Zen style" of using enumerations and pipelines to describe dataflow into a hardware description language representation of a dataflow circuit. The synthesis flow analyzes the Elixir code, generates hardware modules from functions, connects them as a dataflow circuit, and outputs the final circuit description along with an interface driver for communication between the generated hardware and a Elixir software application. This allows accelerating parts of Elixir code by offloading processing to customized hardware circuits designed from the Elixir code.
Flexible dsp accelerator architecture exploiting carry save arithmeticIeee Xpert
This document proposes a novel flexible accelerator architecture comprising computational units (FCUs) that can efficiently perform DSP operations using carry-save arithmetic. Each FCU operates directly on carry-save operands and can be configured to perform templates of common DSP operations like multiplication and addition/subtraction. By keeping operands in carry-save format throughout the FCU, intermediate conversions are avoided, improving performance compared to prior approaches. The proposed architecture aims to achieve high computational density while reducing area and power compared to existing inflexible accelerator designs.
1. The document describes Glacier, a component library and compiler for implementing continuous queries on FPGAs.
2. Glacier includes common streaming operators as well as specialized building blocks for the FPGA context. It can implement a variety of streaming queries by composing these components.
3. The paper evaluates the performance of queries implemented on an FPGA using Glacier, finding they can process over 1 million tuples per second directly from the network interface.
The document provides a summary of various physical design problems Lee Johnson has solved using Tcl scripting in different EDA tool environments over several projects from the late 1990s to recent years. It lists solutions by project, including mesh clock routing flows in Cadence Innovus from 2015-2016, floorplanning and routing scripts for IBM chips from 1997-2015, and scripts addressing problems like pin placement, bus routing, and macro placement for other ASIC projects during the same period. It also provides examples of general utilities developed.
Iaetsd pipelined parallel fft architecture through folding transformationIaetsd Iaetsd
This document presents a new VLSI architecture for a real-time pipeline FFT processor using fused floating point operations. It proposes high radix floating point butterflies implemented with two fused operations: a two-term dot product and add-subtract unit. Both discrete and fused radix processors are compared in terms of area. Higher throughput is achieved using a proposed architecture with conflict-free memory access and a new addressing scheme for radix-16 FFT.
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Deepak Shankar
The document discusses comparing the performance and power of ARM Cortex and RISC-V processors for AI applications. It outlines a methodology for modeling systems from the microarchitecture to SoC level using different instruction sets. Examples are provided to demonstrate how the methodology can be used to improve the accuracy of comparisons between architectures.
The document proposes a new framework for hardware/software co-design that raises the abstraction level to allow uniform system description and faster simulation. The framework uses parameterizable hardware cores along with estimation to explore design spaces and generate optimized implementations. It has been applied to the RoadRunner project and focuses on further developing the hardware aspects. Future work includes expanding to software domains and refining system simulation and estimation techniques.
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdlIaetsd Iaetsd
This document proposes a flexible accelerator architecture that exploits carry-save arithmetic to efficiently implement digital signal processing kernels. The architecture includes flexible computational units that can be configured to perform chained addition, multiplication, and addition operations directly on carry-save formatted data without intermediate conversions. Experimental results show the proposed architecture delivers average gains of 61.91% in area-delay product and 54.43% in energy consumption compared to state-of-the-art flexible data paths.
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
This document discusses the RAMSES project, which aims to develop a new science of end-to-end analytical performance modeling of science workflows in extreme-scale science environments. The RAMSES research agenda involves developing component and end-to-end models, tools to provide performance advice, data-driven estimation methods, automated experiments, and a performance database. The models will be evaluated using five challenge workflows: high-performance file transfer, diffuse scattering experimental data analysis, data-intensive distributed analytics, exascale application kernels, and in-situ analysis placement.
Vlsi design process for low power design methodology using reconfigurable fpgaeSAT Journals
Abstract
Modern digital processing applications have an increasing demand for computational power while needing to preserve low power dissipation and high flexibility. For many applications, the growth of algorithmic complexity is already faster than the growth of computational power provided by discrete general-purpose processors. A typical approach to address this problem is the combination of a processor core with dedicated accelerators. Since changes in standards or algorithms can change the demands on the accelerators, an attractive alternative to highly customized VLSI macros is suggested with the usage of reconfigurable embedded FPGAs (eFPGAs). Keyword: embedded FPGA, Fast computing, Hybrid design.
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
Logic synthesis is the process of converting a high-level design description into an optimized gate-level representation using a standard cell library and design constraints. The process involves translating the RTL description into an unoptimized internal representation, optimizing the logic, technology mapping, and producing an optimized gate-level netlist. An example logic synthesis flow is described for a 4-bit magnitude comparator design from RTL to optimized gates.
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
Parquet is a columnar storage format that provides efficient compression and querying capabilities. It aims to store data efficiently for analysis while supporting interoperability across systems. Parquet uses column-oriented storage with efficient encodings and statistics to enable fast querying of large datasets. It integrates with many query engines and frameworks like Hive, Impala, Spark and MapReduce to allow projection and predicate pushdown for optimized queries.
Flexible dsp accelerator architecture exploiting carry save arithmeticNexgen Technology
The document proposes a novel flexible accelerator architecture comprising computational units (FCUs) that support the execution of various digital signal processing (DSP) operation templates. The FCUs perform computations using carry-save (CS) arithmetic, allowing intermediate results to be reused without conversion to binary. This enables more aggressive CS optimizations than previous approaches. The proposed architecture analyzes logic size, area, and power consumption using Xilinx 14.2. Each FCU can be configured to perform addition, subtraction, and multiplication operations in a pipelined fashion to fuse computations and improve performance.
- Service chaining provides a common way to deliver multiple services in a specific order, decoupling network topology from services and enabling dynamic service insertion.
- It has both a data plane, using a common service header (NSH) to build service chains, and a control plane for policy and mapping overlay addresses to the physical network.
- Work has included implementing NSH encapsulation/decap in OVS and adding WireShark support, with ongoing work on LISP integration and control plane functionality.
The document discusses multiprocessor system-on-chip (MPSoC) communication fabrics and network-on-chip (NoC) interconnect architectures. It describes how MPSoCs are used in applications like cell phones and digital TV. It then discusses challenges in MPSoC design and why NoC approaches are better than bus-based or symmetric multiprocessor designs. Finally, it summarizes some example NoC implementations like IBM CoreConnect and the xPipes Lite application-specific NoC.
Mirabilis_Design AMD Versal System-Level IP LibraryDeepak Shankar
Mirabilis Design provides the VisualSim Versal Library that enable System Architect and Algorithm Designers to quickly map the signal processing algorithms onto the Versal FPGA and define the Fabric based on the performance. The Versal IP support all the heterogeneous resource.
Similar to Architectural_Synthesis_for_DSP_Structured_Datapaths (20)
This document describes a speed-up technique for a Windows image scalar algorithm. It involves detecting when an output pixel generation cycle will be immediately followed by an input pixel consumption cycle. In this case, the cycles can be merged to improve performance. Specifically:
- During an output cycle, the algorithm checks if the remaining input fragment after subtracting the output fragment is less than the inverse scale factor.
- If so, the input pixel is fully consumed in this merged cycle. The accumulator is updated, the output pixel is produced, and a new input pixel is fetched.
- This avoids retaining the input pixel for an extra cycle and improves efficiency, especially for decimation cases where an input pixel often contributes to multiple
This document describes software for 2D block scaling and rotation control. It includes a top level function for scaling and rotating images and describes the dependencies and sub-functions. It focuses on vertical block scaling control, explaining how it determines the number of vertical blocks, initializes starting/ending rows for input/output blocks, and adjusts these values based on scaling factors and scan direction.
The document analyzes the performance of single BLT (bit blit) operations for clearing blackness on images of varying heights from 100 to 600 pixels. It finds that the total time for BLT operations increases linearly with image height. On average, each BLT operation takes approximately 1.35 3D GPU clocks or 12.3 nanoseconds per pixel, with some variation depending on the image height.
The document discusses color processing using the CIECAM02 color appearance model. It begins with an agenda that covers challenges, color spaces like RGB, XYZ, LMS, and CIECAM02. It then explains CIECAM02 and its inverse, how they model human color perception and account for viewing conditions. The document discusses color processing techniques like contrast enhancement, saturation adjustment, hue manipulation, and gamut mapping to handle out-of-gamut colors. It aims to perform color processing and management across the color reproduction chain from capture to display in a perceptually accurate manner.
The document discusses post-processing deblocking filters used in video coding standards like H.264 and MPEG-2. It describes how blocking artifacts can occur during video compression due to quantization and motion compensation. It then explains that deblocking filters help reduce blocking artifacts by applying filtering to block boundaries in the decoded video. Specifically, it discusses the differences between post-processing and in-loop deblocking filters, and provides details on how deblocking is implemented in standards like H.263+, H.264, MPEG-2, and JPEG.
The document proposes approximating the logarithm function log2 through piecewise linear interpolation over intervals of the input domain. It evaluates the approximation error for varying numbers of intervals over two ranges, [0.5, 1] and [1, 2], and shows that the error decreases as the number of intervals increases. Plots of the true log2, approximated log2, and approximation error support this finding. The approximation achieves high accuracy with over 64 intervals.
The document describes a video noise reduction system that uses an adaptive recursive filter. It averages a portion of the input frame with a delayed frame to reduce noise while preserving edges and details where there is no motion. The amount of noise reduction depends on the number of frames averaged and a parameter k that adapts to the average noise level. It also uses adaptive coring thresholds based on measured noise levels to determine whether pixels are filtered, bypassing the filter for large differences likely due to motion rather than noise. The system architecture includes components for YC separation, noise measurement, filtering, and output formatting. Performance results show improved noise reduction over time as more frames are averaged while minimizing ghosting artifacts from motion.
This document describes a video color processing algorithm that aims to improve color accuracy and image quality on mobile devices. It discusses developing algorithms to enable color enhancements without distortions, adapting to viewing conditions like ambient light, and accurately reproducing colors on wide gamut displays. The algorithm uses the CIECAM02 perceptual color model and involves offline computation of various parameters to transform color spaces and enable color and contrast processing.
Inertial sensors use a mass-spring system where a proof mass is suspended by a spring and responds to input forces. The displacement of the mass is measured to sense the force. Forces can be applied through electrostatic transduction. Capacitive sensing is commonly used to measure the displacement of the mass. The system acts as a second-order dynamical system where the input force is transduced to mass displacement which is then transduced to an output charge. Key parameters that impact sensor performance include the transduction gain and damping forces.
- Earth's magnetic field is normally uniform, but can be distorted by hard and soft iron distortions.
- Hard iron distortions are caused by permanent magnets adding a constant offset, while soft iron distortions are caused by magnetically permeable materials distorting the field.
- To compensate for these distortions, hard iron offsets are subtracted from readings and soft iron scale factors are multiplied to readings based on data from rotating the sensors.
MP3 Audio Decoding involves perceptual audio encoding using psychoacoustic analysis and quantization. It uses a filter bank to split audio into 32 subbands and a hybrid filter bank combining MDCT and traditional filter banks. Quantization and encoding involves bit allocation across scalefactor bands based on masking thresholds from the psychoacoustic model. The decoder reconstructs audio using inverse quantization and filtering.
The document describes the android::Fusion class which performs sensor fusion to estimate attitude and gyro bias from gyroscope, accelerometer, and magnetometer sensors. The Fusion class contains public and private member functions for initialization, sensor data handling, prediction, updating the state estimate, and retrieving results. It uses quaternions to represent attitude and a Kalman filter to fuse the sensor data.
Gyroscope sensors measure angular velocity by detecting the Coriolis effect on a vibrating mass. They have specifications including measurement range, number of sensing axes, nonlinearity, temperature range, and noise parameters. MEMS gyroscopes typically use a vibrating proof mass driven electrostatically while rotation is detected via sense electrodes measuring the Coriolis-induced deflection perpendicular to the drive mode. The Coriolis effect causes an apparent deflection in a rotating reference frame due to inertial forces.
The 2D composition engine provides the following key capabilities in 3 sentences or less:
It performs 2D graphics operations like block copy, rotation, scaling, color space conversion, alpha blending, and ROP operations. It supports various image formats and color spaces. The architecture includes a core processing unit with functional blocks for scaling, rotation, Porter-Duff compositing, and ROP, and it interfaces with external memory and clients through a VPDMA unit.
The document describes an algorithm for block-scaling control during vertical resizing of images. It involves dividing the target image into vertical blocks, and computing the corresponding input blocks based on the scaling ratio and scan direction. For each target block, it determines the start and end rows of the corresponding input block. It also tracks the start rows of subsequent blocks to account for cases where a block maps to a whole number of input rows. This ensures accurate mapping between input and output blocks during upscaling and downscaling in both vertical up and down scan directions.
The document compares the 2DBitBlt resampling scaler architecture to other scaling architectures. 2DBitBlt resampling uses a hardware efficient algorithm adapted from image warping with weighted resampling and no power of 2 limitation. It performs anti-aliasing as part of the algorithm and has potential for parallel processing. Charts show 2DBitBlt resampling outperforming polyphase and bicubic scaling in terms of aliasing, while being simpler with a single line buffer. While images may be softer than bicubic, it has advantages of guaranteed anti-aliasing and better performance for higher decimation ranges.
This document discusses the xvYCC color space, which provides better gamut coverage than sRGB. It explains that the color gamut of an RGB system can be visualized as a triangle in the xyY plane. It then describes how xvYCC represents an 8-bit color space and how its gamma correction differs from the standard sRGB gamma correction in order to accommodate its expanded gamut. Finally, it shows how xvYCC affects the R, G, and B color components both with and without gamma correction applied.
The document describes the Mismatch Noise Cancellation (MNC) architecture. The key components of the MNC architecture are:
1. A pseudo-random number generator that generates random binary sequences.
2. A mismatch estimation block that estimates mismatches.
3. A noise cancellation block that corrects the effects of mismatches.
4. Synchronization elements that synchronize data flow.
This document is a thesis submitted by Shereef B. M. Shehata to Concordia University in 1997 for the degree of Doctor of Philosophy in Electrical and Computer Engineering. The thesis proposes a technique for high level synthesis of digital signal processing cores targeting field programmable gate arrays (FPGAs). The technique aims to optimize the total execution time of the synthesized architecture using integer linear programming while accounting for the structural characteristics of FPGAs early in the synthesis process. This includes optimizing interconnect usage and estimating system clock duration.
1. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured
Datapaths
Shereef B. M. Shehata
2. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
OUTLINE
• An overview of the architectural Level Synthesis Problem.
• Subtasks of the High Level Synthesis problems
Ë Scheduling
Ë Binding
Ë Architecture Optimization
• NP-hard Algorithms(Heuristics versus Mathematical Programming techniques)
• Novel Mathematical Programming Formulation of the Synthesis Problem:
Ë Linearization of the Quadratic Nonlinear Problem
Ë Optimization of Performance and Structural Complexity
Ë Techniques To improve the Solution time for ILP formulation:
Ë Heuristics as Bounds for Mathematical Programming.
• Results for typical HLS benchmarks.
• Conclusion.
•
•
•
3. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Motivation
To develop an architectural synthesis technique specific to the synthesis of
architectures for DSP targeting FPGA implementations.
The technique is general enough to accommodate other technologies, such as new
submicron technologies.
To provide an accurate evaluation method for our High Level Synthesis
methodologies.
• The total execution time is the yardstick for Performance comparison and
not The number of control steps.
Exploit important features of FPGA technology:
• Large number of Registers
• FPGA utilization is largely reduced with complex interconnections
• High multiplexer cost.
• Wide difference between the delays of multiplications and additions.
• Efficient RAM storage.
• Dedicated high-speed carry-propagation circuit
5. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The Symmetrical Array FPGA Module (Xilinx)
Ë CLB routing is associated with each row and column of the CLB array.
Ë Global Routing consists of dedicated networks primarily designed to distribute clocks
throughout the device with minimum delay and skew. It can also be used to distribute high fan-
out signals throughout the device with minimum delay.
Ë Global nets and buffers has increased in more recent Xilinx 4000 generation to allow more
flexibility in routing.
Programmable
Connection Matrix
Programmable
Switching Matrix
Programmable Logic Block
6. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
XC4000 family switch box architecture
Ë SRAM configuration cell, implies Reuse, and prototyping. The hardware becomes
reconfigurable and the designer can update the system on the fly.
Ë The total size of the SRAM configuration cell and the transistor switch that the SRAM drives
is larger than the programming devices used in antifuse technologies.
Interconnect Points Switch Matrix
DataLines
Six pass transistors per switch
matric interconnect point
Data Lines
Ë The horizontal and vertical single- and double-length lines intersect at a box called a
programmable switch matrix. Each switch matrix consists of programmable pass
transistors used to establish connections between the lines.
7. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The Xilinx 4000 Configurable logic block (dedicated carry logic is not shown)
Ë The inputs C1-C4 can also be used to control the use of the F and G- LUTs as 32-bits of SRAM.
Ë Mux control maps four control inputs (C1-C4) into: LUT input H1, direct in (DIN), enable
clock (EC) and set/reset for the flip flops.
Ë The XC4000 CLB has also has special fast dedicated carry logic hardwired between the
CLBs.
G1
G2
G3
G4
F4
LUT
LUT
LUT
multiplexer
C1 C2 C3 C4
R
S
state
state
D
D
Q
Q
G
Q2
Q1
Fclock
Programmable
H1
DIN
F1
F2
F3
Carry outCarry in
Carry outCarry in
to/from adhacent CLBs
to/from adhacent CLBs
8. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Carry propagation paths in Xilinx 4000 series
Ë The carry chain in XC4000 can run either up or down. At the top or bottom of the columns
where there are no more CLBs, the carry is propagated to the right.
Ë The Fast carry logic can be accessed by using Relational Placed Macros that already include
special library symbols for using the fast carry logic.
Ë The carry logic shares operands and control with the function generators.
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB CLB CLB CLB
Dedicated carry-path
9. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Interconnect Overview for the XC4000 family
Long
Double
Single
Quad
Quad
Long
Global
Clock
Long
Double
CLB Direct
Connect
Long
Carry
Chain
Direct
Connect
Single
Global
Clock
11. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Details of XC4000 dedicated carry logic.
Ë The two 4-input function generators can be configured as a 2-bit adder with built-in hidden
carry that can be expanded to any length.
Ë This dedicated carry circuitry is so fast that conventional speed-up methods like carry
generate/propagate has marginal benefit at the 32-bit level and almost no effect at the 16-bit
level.
Ai+1Bi+1
Si
Si+1
Ci+2
G-Function Generator
F-Function Generator
Bi
Ai
Ci
12. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Details of a Logic Array Block (LAB) in FLEX 8000 family
4
4
4
4
4
4
4
4
4
4
8
8 16
8
Carry-out to the LAB
on the right
LAB Local
interconnect
Carry-in
from the LAB
on left
Row Interconnect
Column Interconnect
LAB Control
Signals
LE
LE
LE
LE
LE
LE
LE
LE
Ë There are Eight LEs stacked
to form a Logic Array Block
(LAB)
13. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
FLEX 8000 Logic Element(LE)
Ë The FLEX LE uses a four-input LUT, a flip-flop, cascade logic and carry logic.
Carry
Chain
Look-Up
Table(LUT)
Cascade
Chain
QD
CLRN
PRN LE Out
Carry-In Cascade-In
DATA1
DATA2
DATA3
DATA4
LABCTRL1
LABCTRL2
LABCTRL3
LABCTRL4
Clear/Preset
Logic
Carry-Out Cascade-Out
Clock
Select
14. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Flex 8000 device block diagram
IOE
IOE
IOE
IOE
IOE
IOE
IOE
IOE
IOEIOE
IOEIOE
IOEIOE
IOEIOE
Fast Track Interconnect
I/O Element
Logic
Element
Logic Array
Block(LAB)
15. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
General Architecture Model
FUi FUj
R
Chaining Register
Interconnect
Register Mux FU
Mux
FU O/P
Tristate Bus
One of the Pipelined Busses
Driver
Register File
( RAM) Modules
FU
Module
Register
Mux
FU Mux
Sub-Module
(Optional)
(Optional)(Optional)
Control Unit
InterconnectControl
signals
Function Units and Register
Control Signals
16. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
CDFG
- Data Storage Assignment
STEP-LAST: Register Allocation
STEP-4: ILP: Bus Insertion
-Bus transfer
scheduling
-Bus allocation
-Storage Minimization
-Bus loading Minim.
-Interconnect minimization.
-Bus loading minimization.
- Scheduling and Binding
- Chaining of Operations
STEP-3: ILP: Random Topology
-Interconnect minimization.
- Clock cycle minimization +
- FU pipelining choice
ation of the numberMinimiz
of cycles.
OR
- Minimization of the total
execution time, (i.e. throughput
maximization).
- VHDL generation of the
Datapath and the Controller
- Heuristics to determine the lower bound on the number of
cycles.
- Heuristics to tighten the ASAP/ALAP values under the given
resource constraints.
DFG
-DFG exploration.
-Dynamic Set generation for chaining
-ILP constraint generation
To Logic Synthesis tools
STEP-2: C++: Constraint Generation for ILP
STEP-1: Scheduling Bounds
Tech
17. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Flow of the Back-End Tools
Ë Stage-2 uses Synopsys tools(logic synthesis and FPGA mapping), and stage-3 uses
Xilinx(xact tools) for PPR
VHDL SOURCE FILES
- Xilinx Hard-macros
Simulate
Read HDL and insert pads
- Area Constraints
- Delay Constraints
- FU-Pipelining (i.e.
Register-balancing)
- Xilinx Library
To simulation
Partition, Placement
and Routing
Xilinx
SYNOPSYS
compile and optimize the
datapath and controller
Stage-3
Stage-2
19. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Basic Definitions.
Ë A Polyhedron “P“: is the set of points that satisfy a finite number of linear
inequalities, that is:
Ë A polytope: is a bounded polyhedron, that is:
Ë A Polyhedron Face: The set is called a face of P and
the valid inequality is said to define the face F.
P R
n
⊆ P x R
n
∈ A x⋅ b≤
=
,
w∃ R
1
∈ P x R
n
∈ w– x j w≤ ≤( ) j∀ j 1…n=,( )
⊆
F x P∈ π x⋅ π0={ }=
π x⋅ π0≤
20. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Ë The Convex Hull: Given a set , a point . The Convex hull of S
denoted by Conv(S) is the set of finite points that can be written as a convex
combination of points in S.
Ë where x1, x2, ..., xt are any finite set of points in S. The convex hull Conv(S) can
be described by a finite set of linear inequalities.
S R
n
⊆ x R
n
∈
Conv S( ) x R +
n
∈ x λi x
i
⋅
i 1=
∑=
=
λi
i 1=
t
∑ , λ R +
t
∈
21. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Ë A partially ordered set: , or poset, is a non-empty set X and a binary relationship B
on X which is reflexive, anti-symmetric and transitive. The elements of X are called points
and the binary relationship B is called partial ordering on X.
Ë A strict partially ordered set: , or Sposet, is a non-empty set X and a
binary relationship on X which is irreflexive, anti-symmetric and transitive.
Ë We use to denote that and to denote that .
Ë A Hasse diagram: of a poset (X,P) is a drawing in which the points of X are places
so that if y covers x, then y is placed at a higher level than x and joined to x by a line
segment. The corresponding graph is called a Hasse Graph of the poset.
Ë A Clique in a graph G = (V,E) is a with the property that every pair of nodes in C is
joined by an edge.
Ë A subset of the vertices of the graph is an r-clique if it induces a complete
subgraph, i.e.
Ë A stable set (or independent set) of vertices is a subset X of the vertex set of a graph G,
no two of which are adjacent.
X B,( )
X B˜,( )
B˜
xBy x y,( ) B∈ xB˜ y x y,( ) B˜∈
C V⊆
A V⊆ G V E,( )=
GA Kr≅
22. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Ë A Comparability graph: is an undirected graph that is transitively orientable.
That is each edge can be assigned a one-way direction such that the resulting
directed graph G = (V,E) satisfies the following condition: and
imply .
Ë A graph G is a triangulated graph, if for every simple cycle of length strictly greater than
3 posses a chord.
Ë The stability number of G is the number of vertices in a stable set of
maximum cardinality.
Ë The chromatic number of G the smallest possible k for which there exists
a proper k-coloring of G.
Ë The clique number of G is the number of vertices in a clique of maximum
cardinality.
Ë The clique cover number is the fewest number of complete subgraphs
needed to cover the vertices of G, i.e. the size of the smallest possible clique cover
of the graph G.
a b,( ) E∈ b c,( ) E∈
a c,( ) E∈ a b c, ,∀ V∈
α G( )
γ G( )
ω G( )
θ G( )
23. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Ë A Vertex packing on a graph G = (V,E) is a set of vertices , with the property
that no pair of vertices in U is joined by an edge.
Ë The fractional vertex packing polytope of a graph G = (V,E) is
where and is the maximal clique matrix of
the graph G.
U V⊆
P x R +
n
∈ κ x⋅ 1≤
= n V= κ
25. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Simultaneous Performance Optimization and Interconnect minimization
• Exploration of much larger solution space guided by a Highly selective objective
function that rejects architectures with more interconnection unsuitable for FPGA
implementation.
• Developing an ILP formulation that incorporates:
Ë Multilevel chaining of operations and deeply pipelined functional units which are
effective for FPGAs.
Ë Optimal scheduling and binding of Operations while minimizing interconnections.
Ë Determination of the system clock duration.
Ë Minimization of the Total execution time vs. the number of control steps.
26. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Details of the Integer Linear Programming Formulation
• Operation Assignment Constraints
Ë This Constraint assigns Every Operation of the DFG to only one control step and one FU.
Xop n s,,
n 1=
Nt
∑
s Range op( )∈
∑ 1 op∀=
Xi,1,1 Xi,2,1 Xi,3,1
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xj,1,5 Xj,2,5
21
Op j
ALAP(opj)
1 2 3
Op i
ASAP(opj)
ALAP(opi)
ASAP(opi)
The variables in the shaded region add up to 1.
OPi
OPj
precedence
27. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Details of the Integer Linear Programming Formulation
• Function Unit Assignment Constraint
Ë Each FU has at most only one operation assigned at a given time.
Xop n p,,
op Fut∈
∑
p s=
s L op( )– 1+
∑ 1≤ n s∀,∀
Xi,1,1 Xi,2,1 Xj,1,1 Xj,2,1
Xi,1,2 Xi,2,2 Xj,1,2 Xj,2,2 Xk,1,2 Xk,2,2
Xi,1,3 Xi,2,3 Xj,1,3 Xj,2,3 Xk,1,3 Xk,2,3
Xi,1,4 Xi,2,4 Xj,1,4 Xj,2,4 Xk,1,4 Xk,2,4
Xj,1,5 Xj,2,5
Op i
1 2
Op k
1 2
Op j
1 2
c-step1
c-step2
c-step3
c-step4
c-step5
The summation of these variables is less than 1
OPi
OPj
precedence
28. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Details of the Integer Linear Programming Formulation
• Scheduling partially ordered operations has to follow the precedence order (no
Chaining)
X
opi n p, ,
X
op j n p, ,
n 1=
Ntj
∑ 1≤
p ASAP op j( )=
s
∑+
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑
ASAP op j( ) s ALAP opi( ) D opi( ) 1–+≤ ≤
s∀ opi op j→( )∀,
Xi,1,1 Xi,2,1 Xi,3,1
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xi,1,5 Xi,2,5 Xi,3,5 Xj,1,5 Xj,2,5
Xi,1,6 Xi,2,6 Xi,3,6 Xj,1,6 Xj,2,6
OPi
OPj
precedenceASAP(opj)
current c-step
The variables in the shaded region add up to 1
ALAP(opi)
ASAP(opi)
1 2 3
Op i
21
Op j
29. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
To Determine the Total length of the schedule
Ë The following constraint illustrates the determination of the total number
of steps T, from the schedule of the operations in the set W.
Where W is the set of operations without Successors in the DFG.
Ë The variable T has both an upper and lower bound (Determined from
Heuristics) as:
s Xop n s,, T–×
n 1=
Nt
∑
s Range op( )∈
∑ D op( ) 1+–( )≤ op W∈∀
T Tcr≥
T Tcr T∆+≤
30. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Constraints to minimize the structural complexity of the synthesized Architecture
Ë Counting the number of Motifs
Ë A corresponding term to minimize the MOTIFSUM is included in the objective function
to increase the utilization of the already assigned interconnect between different Function
units.
Xi,1,1 Xi,2,1 Xi,3,1
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,3
Xj,1,5 Xj,2,5
Xo pi n s,,
s Range opi( )∈
o pi Fut∈
∑ Xo p j n s,,
s Range op j( )∈
o p j Fut′∈
∑+
Motif Fut n Fut′ n′,,,( ) 1≤–
o pi op j→( )∀
n n 1…Nt=( )∀
n′ n′ 1…Nt′=( )∀
1 2 3
Op i
21
Op j
c-step 1
c-step 3
c-step 2
c-step 4
ASAP(op
i
)
ASAP(op
j
)
ALAP(op
i
)
ALAP(op
j
)c-step 5
The summation of these variables sets the value of Motif A 2 M 1,,,( )
31. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Constraints to minimize the structural complexity of the synthesized Architecture
Ë Counting the number of Chaining Motifs
Ë A corresponding term to minimize the CMOTIFSUM is included in the objective
function to increase the utilization of the already assigned Chaining interconnect between
different Function units.
Xi,1,1 Xi,2,1 Xi,3,1
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xj,1,5 Xj,2,5
1 2 3
op i
21
opj
c-step 1
c-step 3
c-step 2
c-step 4
The summation of these variables sets the value of CMotif A 2 M 1,,,( )
opi
opj
Precedence
c-step 5
Xo pi n s,,
o pi Fut∈
∑ Xo p j n s,,
o p j Fut′∈
∑+
CMotif Fut n Fut′ n′,,,( ) 1≤–
s∀ , o pi op j→( )∀
n n 1…Nt=( )∀
n′ n′ 1…Nt′=( )∀
32. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Ë Counting Incompatible Motifs
Ë The idea is to minimize the number of Motifs that terminates on the Same Function unit. This will
decrease the number of Multiplexers in the synthesized architecture.
Moti f Fut n Fut′ n′,,,( )
n 1=
Nt
∑
Fut
∑ Incom p Fut′( )– 0≤
n′∀
Fut′∀
'1
'3
'2
'1
'1
'3
'1
'3
'1
I/O
'2
'3
'1
Schedules and Motifs Architecture
33. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Ë Minimizing the Maximum Number of edges with the Same FU Destination
Type(Incompatible Motifs).
Introducing an integer variable to count the number of incompatible Motifs.
Moti f Fut n Fut′ n′,,,( )
n 1=
Nt
∑
Fut
∑ Incom p Fut′( )– 0≤ Fut′∀ n′∀,
(a) (b) (c)
35. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Minimizing the Maximum Number of Edge Overlap
K Xopi n p, , Xop j n p, ,
n 1=
Ntj
∑
p =
ASAP o p j( )
s
∑–
n 1=
Nti
∑
p =
ASAP o pi( )
s
∑
o pi op j→( )∀
op j Fut∈
edge wrap∉
∑
×
Xopi n p, , Xop j n p, ,
n 1=
Ntj
∑
p s 1+=
ALAP op j( )
∑+
n 1=
Nti
∑
p =
ASAP o pi( )
s
∑
o pi op j→( )∀
op j Fut∈
edge wrap∈
∑+
M– axovla p Fut( ) 0≤ s∀ Fut∀
K 1
36. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Formulation for chaining of Two operations per control step
• The destination operation can not be scheduled “before” the source operation.
Ë However, they can be share the same control step.
Xi,1,1 Xi,2,1 Xi,3,1 Xj,1,1 Xj,1,2
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xi,1,5 Xi,2,5 Xi,3,5 Xj,1,5 Xj,2,5
Xi,1,6 Xi,2,6 Xi,3,6 Xj,1,6 Xj,2,6
ASAP(opj)
current c-step
ALAP(op i)
The Summation of the variables in the shaded regions add up to 1
21
Op j
1 2 3
Op i
OPi
OPj
precedence
X
opi n p, ,
X
op j n p, ,
n 1=
Ntj
∑ 1≤
p ASAP op j( )=
s 1–
∑+
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑
ASAP op j( ) s ALAP opi( ) D opi( ) 1–+≤ ≤
s∀ opi op j→( )∀,
37. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Formulation for chaining of Two operations per control step
• The source operation can not be scheduled “after” the destination operation.
Ë However, they can share the same control step.This constraints and the previous one are not
redundant. They tighten the Formulation.
Xopi n p, , Xop j n p, ,
n 1=
Ntj
∑ 1≤
p s=
∑+
n 1=
Nti
∑
p s D opi( )– 2+=
ALAP opi( )
∑
ASAP op j( ) s ALAP opi( ) D opi( ) 1–+≤ ≤
s∀ , opi op j→( )∀
Xi,1,1 Xi,2,1 Xi,3,1 Xj,1,1 Xj,2,2
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,2 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xi,1,5 Xi,2,5 Xi,3,5 Xj,1,5 Xj,2,5
Xi,1,6 Xi,2,6 Xi,3,6 Xj,1,6 Xj,2,6
211 2 3
Op i Op j
ASAP(opj)
current c-step
ALAP(opi)
ASAP(opi)
The variables in the shaded region add up to 1
OPi
OPj
precedence
38. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Formulation for chaining of Two operations per control step
• The following constraint prevents chaining of more than two operations in the same
control step.
Xopi n p, , Xopk n p, ,
n 1=
Ntk
∑ 1≤
p s=
∑+
n 1=
Nti
∑
p s D opi( ) 1+–=
∑ s∀ , opi op j,( )∀ ℜ2∈
ASAP opk( ) s ALAP opi( ) D opi( ) 1–+≤ ≤
Xi,1,1 Xi,2,1 Xi,3,1 Xj,1,1 Xj,2,1
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xi,1,5 Xi,2,5 Xi,3,5 Xj,1,5 Xj,2,5
Xi,1,6 Xi,2,6 Xi,3,6 Xj,1,6 Xj,2,6
OPi
OPj
precedenceASAP(opk)
current c-stepALAP(opi)
OPk
precedence
ASAP(opi)
The variables in the shaded region add up to 1
1 2 3
Op i
21
Op k
39. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Multi- Level Chaining
Ë Patterns to look for in the DFG
Ë Formulation
Ë By generating the set , such that if , and
is a multi-cycle operation(e.g. multiply operation).
Ë The following constraint will then apply to the members of this set
*+ +
opi
opk
opi
opk
*
opi
opk
*
C D
+ +
opi
opk
A B
∆M O
2
⊆ op1 op2,( ) ∆M∈ op1 opM→ opM op2→
opM
Xopi n p, , Xopk n p, ,
n 1=
Ntk
∑ 1≤
p s=
∑+
n 1=
Nti
∑
p s D opi( ) 1+–=
∑
ASAP opk( ) s∀ ALAP opi( ) D opi( ) 1–+≤ ≤
o pi o pk( , )∀ ∆M∈
40. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Delay model for an N-bit adder implemented in Xilinx FPGAs
For the Xilinx 4000 series, is 0.7/1 ns, and is 4 ns.
Ë The delay is linear with the number of bits. This proportionality factor is , and as such
they make the fastest possible carry path circuits.
Adder
S0 S1 S2 S3 S4 S5 SN-4 SN-3 SN-2 SN-1
TOPCY
Tsum
LSB MSB
A0,
B0
A1,
B1
A2,
B2
A3,
B3
A4,
B4
AN-4,
BN-4
AN-3,
BN-3
AN-2,
BN-2
AN-1,
BN-1
(N-4)/2 CLBs
Tcarry Tcarry Tcarry
CLB
T A TOPCY N 4–( ) 2⁄ Tcarry× Tsum+ +=
Tcarry T
OPCY
Tcarry
41. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Delay model for a pipelined-multiplier chained with an adder
For the Xilinx 4000 series, is 5 ns.
Adder
Last pipeline stage of a multiplier
S0 S1 S2 S3 S4 S5 SN-4 SN-3 SN-2 SN-1
TOPCY
Tcarry
Tsum
Tsum
Tcarry
LSB MSB
Tcarry Tcarry
TOPCY
Tcarry Tcarry
CLB
T pd T pipe Tsum+=
Tsum
43. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Scheduling with Multi-level chaining and Interconnect minimization
+
+
+
+
+
+
+
+
+
+
+
+
i1 i2 i3 i4 i5
i9 i10 i11 i12 i13
i6 i7 i8
out
+
++
+
R1R2
i4 i5 i8 i12 i13i3 i7 i11
i1 i9 i2 i6 i10
+
Adder 2
+ Adder 3
Adder 1
i6
i11
i8
i1
i10 i9
i3 i5 i7 i13 i4 i12
R1
R2
+
Extra Number of Mux inputs: 2
Number of CLBs: 128
Execution time: 84 nsec
Number of registers: 2
Extra Number of Mux inputs: 8
Number of CLBs: 180
Execution time: 96 nsec
Number of registers: 2
++
+
+ +
+
+
+
+
+
+
i1 i2 i3
i4 i5
i6
i7 i8
i9 i10
i11
i12 i13
out
+
44. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Delaying of interconnect optimization after scheduling
Ë Comparison of our results for an addition tree, with methods that restrict the solution space,
or does not minimize interconnect simultaneously with scheduling and binding of operations.
+
+
+
+
+ +
+
+
+
+
out
+ +
i6 i7 i8
i9 i10 i11 i12 i13i1 i2 i3 i4 i5
+
+
+
+
Adder1R1
R3
R2
Adder 3
Adder 2
Adder 4
R4
i9 i10i6 i11i13 i12
i4 i8 i3 i5
i2
i7
i1
Extra number of mux inputs: 7
Number of CLBs: 168
Execution time: 84 nsec
Number of registers: 4
45. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Scheduling and binding for the CDFG with non-pipelined multipliers and no chaining
• The schedule needs 7 control steps, with clock duration of 150ns
+
+ +
+*
*
+
++
+
+
+
+
+
+
Clock cycle: 150 ns
Exec. Time: 7 * 150 = 1050 ns
Resources: 2 Adders, 1 Multiplier
No-Chaining
Non-Piplined Multipliers.
c-step 1
c-step 2
c-step 3
c-step 4
c-step 5
c-step 6
c-step 1
46. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Effect of increasing the resources to three adders and one non-pipelined multiplier on the
scheduling of the CDFG
• Increasing the resources by one adder does not effect the execution time for the CDFG
+
+ +
+
*
*
+
+ +
+
+
+
+
+
+
Clock cycle: 150 ns
Exec. Time: 7 * 150 = 1050 ns
Resources: 3 Adders, 1 Multiplier
No-Chaining
Non-Piplined Multipliers.
c-step 1
c-step 2
c-step 3
c-step 4
c-step 5
c-step 6
c-step 7
47. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Scheduling and binding for the CDFG with pipelined multipliers and no chaining.
• The schedule needs 8 control steps, with clock duration of 80ns
+
+
+
*
*
+
+
+
+
+
+
+
+
+
+
Clock Cycle: 80 ns
Execution Time: 8 * 80 = 640 ns
Resources: 2 Adders, 1 Multiplier
No-Chaining
Pipelined Multipliers.
c-step 1
c-step 2
c-step 3
c-step 4
c-step 5
c-step 6
c-step 7
c-step 8
48. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Scheduling and binding of the CDFG, using pipelined multiplier and chaining
• The schedule needs 5 control steps with clock duration of 90 ns.
Clock Cycle: 90 ns
Execution Time: 5 * 90 = 450 ns
Resources = 3 Adders, 1 Multiplier
Pipelined Multipliers
2-level Chaining allowed.
*
*
c-step 1
c-step 2
c-step 3
c-step 4
c-step 5
49. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Synthesized architecture of for the scheduling and binding using pipelining and
chaining.
R4R2R1R3
*
50. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Minimization of the Total Execution time (Performance Optimization)
Ë The Following constraint sets the Clock duration during the solution:
Ë The constraint to set the chaining variable is given below:
Ë The Upper and Lower limits that exist for the Clock Duration:
δ ψijk( ) ψijk× Ω≤ ψijk∀ Ψ∈,
ψMAA
Xopi n p, , Xopk n p, , ψMAA–
n 1=
Ntk
∑ 1≤
p s=
∑+
n 1=
Nti
∑
p s D opi( ) 1+–=
∑
ALAP opi( ) D opi( ) 1–+ s∀ ASAP o pk( ) o pi o pk( , )∀ ℑ2S∈,≥ ≥
opi NM∈( )and o pk NA∈( )
Ωmin Ω Ωmax≤ ≤
51. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Ë The values of the Upper/Lower bounds are determined as follows:
Ë If the clock duration is allowed only discrete values:
Ë is a relaxed version of the discrete valued , that can assume any
positive number.
Ωmax MAX δ Ψ( ){ }=
Ωmin MIN δ Ψ( ){ }=
δ ψijk( ) ψijk× Ωrelaxed≤ ψijk∀ Ψ∈
Ω
Ωrelaxed
Ωmin
------------------------- Ωmin⋅=
Ωrelaxed Ω
53. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Minimization of the DFG Total Execution Time
Ë The Number of control steps (integer) can be represented in terms of Binary Variables:
Ë The part of the Objective function that minimizes the Total execution is Nonlinear
Ë The Objective Function can be conceptually presented as:
T 2i β
i
⋅
i 0=
n 1–
∑=
IN 2
i
CLOCK⋅( ) β
i
⋅
i 0=
n 1–
∑=
I IN IL1+=
54. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Minimization of the DFG Total Execution Time
Ë Linearization of the Nonlinear part of the Objective function
Ë Linearization of the Nonlinear part of the Objective function(cont’d):
IL2 2
i
CLKMIN⋅ βi⋅ Θi+
i 0=
n 1–
∑=
Θi 2
i
CLOCK⋅ 2
i
CLKMIN⋅ βi⋅– 2
i
CLKMAX⋅ 1 βi–( )– i,≥ 0 … n 1–,,=
Θi 0 i,≥ 0 … n 1–,,=
55. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Minimization of the DFG Total Execution Time
Ë Linearization does not increase the complexity of the formulation:
• Where n is the number of discrete variables added to the formulation
Θi
2i CLOCK CLKMAX–( )⋅ if βi is 0 Θi 0≥( ),,
2i CLOCK CLKMIN–( )⋅ if βi is 1 Θi 0≥( ),,
≥
IL2 2
i
CLOCK⋅
i ri, 1=
∑=
n Tlog( ) 2log( )⁄=
56. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Tree Hight Reduction
Ë The performance of the architecture is bounded by the length of the critical path.
Before THR After(THR) Delay Estimation
A B C D
(A + B)+ C + D
A B C D
(A+B) + (C+D)
δ ψAAA( ) δ ψAA( )=
A B C D
(A + B) - C + D
A B CD
(A+B) + (D - C)
δ ψASA( ) MAX δ ψAA( ) δ ψSA( ){ , }=
A B C D
(A + B) + C - D
A B DC
(A+B) + (C - D)
δ ψAAS( ) MAX δ ψAA( ) δ ψSA( ){ , }=
57. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
A B C D
(A - B) + C + D
A B DC
(A-B) + (C +D)
δ ψSAA( ) MAX δ ψAA( ) δ ψSA( ){ , }=
A B C D
(A - B) - C + D
A B CD
(A-B) + (D - C)
δ ψSSA( ) MAX δ ψSA( ) δ ψSA( ){ , }=
58. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
A B C D
(A - B) - C - D
A B DC
(A-B) - (C + D)
δ ψSSS( ) MAX δ ψSS( ) δ ψAS( ){ , }=
A B
C D
(A * B) + C + D
A B
C D
(A * B) + (C + D)
δ ψMAA( ) MAX δ ψMA( ) δ ψAA( ){ , }=
59. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
A B
C D
(A * B) - C + D
A B
D C
(A * B) + (D - C)
δ ψMSA( ) MAX δ ψMA( ) δ ψSA( ){ , }=
A B
C D
(A * B) - C - D
A B
C D
(A * B) - (C + D)
δ ψMSS( ) MAX δ ψMS( ) δ ψAS( ){ , }=
61. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Hasse Graph for scheduling with n-level chaining
1 2 3
1
2
3
4
5
n-1 n
α1 α2 αn−2 αn−1 αn
cstep,s
op
6
7
n+1
Assignement Edges
Timing Edges
62. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Topological Sorting of the Hasse Graph can be modified to be used for Coloring
the Graph
Ë Nodes Are numbered according to topological sorting.
op
cstep,s
1 2 3
1
2
3
4
5
6 1
4
7
10
13
16
3
6
9
12
15
2
5
8
11
14
63. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Two different Colorings for the Hasse Graph for scheduling with 2-level chaining
Ë Nodes are numbered according to the Corresponding color.
op
cstep,s
1 2 3
1
2
3
4
5
6 5
4
3
2
1
4
3
2
1
0
5
4
3
2
1
5
op
cstep,s
1 2 3
1
2
3
4
5
6
4
3
2
1
0
5 4
3
2
1
0
4
3
2
1
0
64. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Topological Sorting of the Hasse Graph can be modified to be used for Coloring
the Graph
opcstep,s
1 2 3 4
1
2
3
4
5
6
1 2 3
4 5 6 7
8
12
16
20
9
13
17
21
10
14
18
22
11
15
19
65. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Two different Colorings for the Hasse Graph for scheduling with 3-level chaining
Ë The graph has 22 nodes and “43” edges. Then number of maximal cliques can not be greater
than 22 (or even equal 22).
Ë The Transitive Closure of the graph has “115” edges.
op
cstep,s
1 2 3 4
1
2
3
4
5
6
5 5 4
4 4 4 3
3 3 3 2
2 2 2 1
1 1 1 0
0 0 0
op
cstep,s
1 2 3 4
1
2
3
4
5
6 5 5 5
5 4 4 4
4
3
2
1
3
2
1
0
3
2
1
0
3
2
1
66. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
An Odd-Hole graph and A Wheel graph
1
2
34
5
6
1
2
34
5
An Odd-Hole Graph
x1 x2 x3 x4 x5+ + + + 2≤
A Wheel Graph
x1 x2 x3 x4 x5 2 x6⋅+ + + + + 2≤
67. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The Extended Wheel Graph
1
2
34
5
6
7
An Extended-Wheel Graph
x1 x2 x3 x4 x5 2 x6⋅ 2 x7⋅+ + + + + + 2≤
68. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Example Constraint Class
Example: for s = 3
Constraint (α2βα1β) for 3-level chain
op
cstep,s
1 2 3 4
1
2
3
4
5
X1 3, X3 2, X3 3, X4 2, X4 3, 2 X1 4,⋅ 2 X1 5,⋅+ + 2≤+ + + +
Xopi a s D opi( ) 1+–( ), ,
a 1=
Nti
∑ Xopk a p, ,
a 1=
Ntk
∑
p s 1–=
s
∑+
Xopl a p, ,
a 1=
Ntl
∑
p s 1–=
s
∑+ + 2 Xopi a p, ,
a 1=
Nti
∑
p s D opi( ) 2+–=
ALAP opi( )
∑⋅
2≤
s∀ Range opi( ) Range opl( )∩( )∈
s D opi( ) 2+– Range opi( )∈ s 1–( ) Range opl( )∈,
opi opk,( ) ℑ2S∈∀ opi opl,( ) ℑ3S∈∀,
69. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
An Extended Wheel Graph Constraint Class for 3-level chaining.
Example Constraint Class
Example: for s = 3
Constraint (α2βα1β) for 3-level chain
op
cstep,s
1 2 3 4
1
2
3
4
5
X1 3, X3 2, X3 3, X4 2, X4 3, 2 X1 4,⋅ 2 X1 5,⋅+ + 2≤+ + + +
Xopi a s D opi( ) 1+–( ), ,
a 1=
Nti
∑ Xopk a p, ,
a 1=
Ntk
∑
p s 1–=
s
∑+
Xopl a p, ,
a 1=
Ntl
∑
p s 1–=
s
∑+ + 2 Xopi a p, ,
a 1=
Nti
∑
p s D opi( ) 2+–=
ALAP opi( )
∑⋅
2≤
s∀ Range opi( ) Range opl( )∩( )∈
s D opi( ) 2+– Range opi( )∈ s 1–( ) Range opl( )∈,
opi opk,( ) ℑ2S∈∀ opi opl,( ) ℑ3S∈∀,
71. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Exploring the Hasse diagram for schedules with 2-level chaining.
class 1
α1
α2
β
class 3
class 4
α1
class 3
α1
class 5
β
start
β
α1/α2
α1/α2
β
class 2
β
β β
72. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
1- Clique Constraint Class for 2-level chainingβ
)
op
cstep,s
1 2 3
1
2
3
4
5
6
Ë The constraint class 1 :β
)
Xop n s,,
n 1=
Nt
∑
s Range op( )∈
∑ 1 op DFG∈∀≤
73. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
2-Clique Constraint Class for 2-level chainingβsα2βs
op
cstep,s
1 2 3
1
2
3
4
5
6
Ë The constraint class 2 :βsα2βs
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xopk n p, ,
n 1=
Ntk
∑
p ASAP opk( )=
s
∑+ 1≤
s∀ Range opi( ) Range opk( )∩( )∈
opi opk,( )∀ ℑ2S∈
74. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
3-Clique Constraint Class for 2-level chainingβsα1β s 1–( )
op
cstep,s
1 2 3
1
2
3
4
5
6
Ë The constraint class 3βsα1β s 1–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n p, ,
n 1=
Ntj
∑
p ASAP op j( )=
s 1–
∑+ 1≤
s∀ Range opi( ) s 1–( ) Range op j( )∈( )∈
opi op j,( )∀ ℑ1∈
75. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
4-Clique Constraint Class for 2-level chaining
Ë The example illustrated in the Figure for class 4 is for the case of both .
βsα1β˜
s 2–( ) j k, ,
i′
α1β s 2– i′–( )
op
cstep,s
1 2 3
1
2
3
4
5
6
Ë The constraint class 4
βsα1β˜
s 2–( ) j k, ,
i′
α1β s 2– i′–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n p, ,
n 1=
Ntj
∑
p ASAP op j( )=
s 1–
∑+ 1≤
s∀ Range opi( ) s 1–( ) Range op j( )∈( )∈
opi op j,( )∀ ℑ1∈
i′ 1=
76. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
5-Clique Constraint Class for 2-level chainingβsα1α1β s 2–( )
op
cstep,s
1 2 3
1
2
3
4
5
6
Ë The constraint class 5βsα1α1β s 2–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n s 1–( ), ,
n 1=
Ntj
∑ Xopk n p, ,
n 1=
Ntk
∑
p ASAP opk( )=
s 2–
∑+ + 1≤
s∀ Range opi( )∈ s 2–( ) Range opk( )∈( )
opi op j,( )∀ ℑ1S∈ , op j opk,( )∀ ℑ2S∈
77. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Exploring 3 the Hasse diagram for schedules with 3-level chaining.
class 1
α1 α2
α3
β α2
class 6
class 9
β
class 8
α1class 6
α1
β
β
α1
class 11
class 12
β
β
class 14
β
start
β
α1/α2/α3
α1/α2/α3
β
α2
class 7
β
class 3
β
class 3
class 5
α1
β
class 2
β
class 4
β
α1
α1
class 8
β
class 10
β
α1
class 11
β
class 13
α1
β
β
78. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
1- Clique Constraint Class for 3-level chainingβ
)
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 1β
) Xop n s,,
n 1=
Nt
∑
s Range op( )∈
∑ 1 op DFG∈∀≤
79. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
2- Clique Constraint Class for 3-level chainingβsα3βs
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
op Ë The constraint class 2 for 3-level chainingβsα3βs
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s
∑+ 1≤
s∀ Range opi( ) Range opl( )∩( )∈
opi opl,( )∀ ℑ3S∈
80. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
3- Clique Constraint Class for 3-level chainingβsα2β s 1–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 3:βsα2β s 1–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xopk n p, ,
n 1=
Ntk
∑
p ASAP opk( )=
s 1–
∑+ 1≤
s∀ Range opi( )∈ s 1–( ) Range opk( )∈
opi opk,( )∀ ℑ2S∈
81. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
4- Clique Constraint Class for 3-level chainingβsα2β˜
s 2–( ) k l, ,
i′
α1β
s 2– i′–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint constraint class 4βsα2β˜
s 2–( ) k l, ,
i′
α1β
s 2– i′–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xopk n p, ,
n 1=
Ntk
∑
p s 1– i′–( )=
s 1–( )
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 2– i′–( )
∑+ + 1≤
s∀ Range opi( )∈ s 2–( ) Range opl( )∈( )
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
i′ 1 i′ s 2 A– SAP opl( )–≤ ≤∀
82. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
5- Clique Constraint Class for 3-level chainingβsα2α1β s 2–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 5βsα2α1β s 2–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xopk n s 1–( ), ,
n 1=
Ntk
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 2–
∑+ + 1≤
s∀ Range opi( )∈ s 2–( ) Range opl( )∈( )
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
83. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
6- Clique Constraint Class for 3-level chainingβsα1β s 1–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 6 :βsα1β s 1–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n p, ,
n 1=
Ntj
∑
p ASAP op j( )=
s 1–
∑+ 1≤
s∀ Range opi( )∈ s 1–( ) Range op j( )∈,
opi op j,( )∀ ℑ1∈
84. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
7- Clique Constraint Class for 3-level chainingβsα1β˜
s 2–( ) j k, ,
i′
α2β
s 2– i′–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 7
βsα1β˜
s 2–( ) j l, ,
i′
α2β
s 2– i′–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n s 1–( ), ,
n 1=
Ntj
∑
s 1– i′–( )
s 1–
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 2– i′–( )
∑+ + 1≤
s∀ Range opi( )∈ s 2–( ) Range opl( )∈,
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ op j opl,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
i′ 1 i′ s 2 A– SAP opl( )–≤ ≤∀
85. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
8- Clique Constraint Class for 3-level chainingβsα1β˜
s 2–( ) j k, ,
i′
α1β
s 2– i′–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 8βsα1β˜
s 2–( ) j k, ,
i′
α1β
s 2– i′–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ X
n 1=
Ntj
∑ op j n p, ,
p s 1– i′–( )=
s 1–
∑ Xopk n p, ,
n 1=
Ntk
∑
p ASAP opk( )=
s 2– i′–
∑+ + 1≤
s∀ Range opi( )∈ s 2– i′–( ) Range opk( )∈( )
opi op j,( ) op j opk,( ),∀ ℑ1S∈ opi opk,( ) ℑ2S∈
i′ 1 i′ s 2 A– SAP opk( )–≤ ≤∀
86. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
9- Clique Constraint Class for 3-level chaining
Ë The example illustrated in the Figure for class 9 is for the case of both .
βsα1β˜
s 2–( ) j l, ,
i′
α1β˜
s 2–( ) k l, ,
i″
α1β
s i′– i″–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 9βsα1β˜
s 2–( ) j l, ,
i′
α1β˜
s 2–( ) k l, ,
i″
α1β s i′– i″–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ X
n 1=
Ntj
∑ op j n p, ,
p s 1– i′–( )=
s 1–
∑+ +
Xopk n p, ,
n 1=
Ntk
∑
p s 2– i′– i″–=
s 2– i′–
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 3 i′ i″–––
∑ 1≤+
s∀ Range opi( )∈ s 3– i′ i″––( ) Range opl( )∈( )
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
i′ 1 i′ s 4 A– SAP opl( )and i″∀ 1 i″ s 3– ASAP opl( ) i′––≤≤( )–≤ ≤∀
max i′ i″+( ) s 3– ASAP opl( )–=
i′ i″, 1=
87. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
10- Clique Constraint Class for 3-level chainingβsα1β˜
s 2–( ) j k, ,
i′
α1α1β
s 2– i–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 10:βsα1β˜
s 2–( ) j k, ,
i′
α1α1β
s 2– i–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ X
n 1=
Ntj
∑ op j n p, ,
p s 1– i′–( )=
s 1–
∑+ +
Xopk n s 2– i′–( ), ,
n 1=
Ntk
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 3 i′––
∑ 1≤+
s∀ Range opi( )∈ s 3– i′–( ) Range opl( )∈( )
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
i′ 1 i′ s 3 A– SAP opl( )–≤ ≤∀
88. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
11- Clique Constraint Class for 3-level chainingβsα1α1β
s 2–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 11βsα1α1β
s 2–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n s 1–( ), ,
n 1=
Ntj
∑ Xopk n p, ,
n 1=
Ntk
∑
p ASAP opk( )=
s 2–
∑+ + 1≤
s∀ Range opi( )∈ s 2–( ) Range opk( )∈( )
opi op j,( ) op j opk,( ),∀ ℑ1S∈ opi opk,( ) ℑ2S∈
89. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
12- Clique Constraint Class for 3-level chaining
Ë The example illustrated in the Figure for class 12 is for the case of both .
βsα1α1β˜
s 2–( ) k l, ,
i″
α1β
s i″–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 12
βsα1α1β˜
s 2–( ) k l, ,
i″
α1β
s i″–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ X
n 1=
Ntk
∑ opk n p, ,
p s 2– i′–( )=
s 2–
∑+ +
Xop j n s 1–( ), ,
n 1=
Ntj
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 3 i′––
∑ 1≤+
s∀ Range opi( ) s 3– i′–( ) Range opl( )∈( )∈
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
i′ 1 i′ s 3 A– SAP opl( )–≤ ≤∀
i′ 1=
90. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
13- Clique Constraint Class for 3-level chaining formulationβsα1α1α
1
β
s 3–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 13βsα1α1α
1
β
s 3–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n s 1–( ), ,
n 1=
Ntj
∑+ +
Xopk n s 2–( ), ,
n 1=
Ntk
∑ Xopk n p, ,∑
p ASAP opl( )=
s 3–
∑ 1≤+
s∀ Range opi( )∈ s 3–( ) Range opl( )∈( )
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
91. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
14- Clique Constraint Class for 3-level chaining formulationβsα1α2β s 2–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 14βsα1α2β s 2–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n s 1–( ), ,
n 1=
Ntj
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 2–
∑+ + 1≤
s∀ Range opi( )∈ s 2–( ) Range opl( )∈
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ op j opl,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
92. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Maximal Clique Constraints are stronger than the Extended Wheel Constraints
Ë The Extended Wheel Constraint:
Ë The combined maximal cliques constraint:
op
cstep,s
1 2 3 4
1
2
3
4
5
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
op
X1 3, X3 2, X3 3, X4 2, X4 3, 2 X1 4,⋅ 2 X1 5,⋅+ + 2≤+ + + +
X1 3, X1 4, X1 5, X+ +
3 1,
X
3 2,
X3 3, X
3 4,
X3 5, X
3 6,
X4 2, X4 3, 2≤+ + + + + + + +
93. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Comparing the logical formulation vs. the maximal clique formulation for the AR
filter
Ë To reach a first integer solution, the maximal clique formulation takes more time
logical formulation
maximal cliques
formulation
Number of iterations (Primal) 1,200 1,480
Number of iterations
(Integer)
1,420 1,706
Number of nodes of Branch
and Bound
54 103
CPU time in sec (primal) 12 (Ultra Sparc 2) 24 (Ultra Sparc 2
CPU time in sec (integer) 19 (Ultra Sparc 2 38 (Ultra Sparc 2
Total CPU time in sec 31 62
Optimality condition first integer first integer
Number of discrete variables
in the formulation
536 536
Number of Single inequali-
ties
7,363 9,256 (25.7% increase)
Termination condition first integer solution first integer solution
94. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Comparing the logical formulation vs. the maximal clique formulation for the AR
filter
Ë The maximal clique formulation achieves an optimal solution within tolerance long before
the logical fomulation.
logical formulation
maximal cliques
formulation
Number of iterations (Primal) 1,200 1,480
Number of iterations (Integer) 8.45E5 14,577
Number of nodes of Branch and
Bound
42,025 596
CPU time in sec (primal) 12 (Ultra Sparc 2) 24 (Ultra Sparc 2)
CPU time in sec (integer) 14,491 (Ultra Sparc 2) 221 (Ultra Sparc 2)
Total CPU time in sec 14,503 245
Optimality condition 0.07 (not achieved) 0.07 (achieved)
Number of discrete variables in
the formulation
536 536
Number of Single inequalities 7,363 9,256
Termination condition. after 5 integer solutions achieved optimal result
within tolerance
95. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Comparing the logical formulation vs. the maximal clique formulation for the
EWF benchmark
Ë The maximal clique formulation achieves an optimal solution within tolerance before the
logical fomulation.
logical formulation
maximal cliques
formulation
Number of iterations (Primal) 3,192 3,659
Number of iterations (Integer) 56,697 4,668
Number of nodes in Branch and
Bound
1,827 190
CPU time in sec (primal) 86 (Ultra Sparc 2) 150 (Ultra Sparc 2)
CPU time in sec (integer) 5.4 E3 (Ultra Sparc 2) 518 (Ultra Sparc 2)
Total CPU time in sec 5.48 E3 668
Optimality condition 0.1 (not achieved) 0.1 (achieved)
Number of discrete variables in the for-
mulation
940 940
Number of Single inequalities 11,154 16,195 (45.2 % increase)
Termination condition after 5 integer solutions achieved optimal result
within tolerance
96. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Comparing the logical formulation vs. the maximal clique formulation for the
DCT benchmark
Ë The maximal clique formulation achieves an optimal solution within tolerance before the
logical fomulation.
logical formulation
maximal cliques
formulation
Number of iterations (Primal) 3,288 (Ultra Sparc 2) 4,623 (Ultra Sparc 2)
Number of iterations (Integer) 23 (Ultra Sparc 2) (Ultra Sparc 2)
Number of nodes in Branch and
Bound
1E4 168
CPU time in sec (primal) 83 312
CPU time in sec (integer) 2E5 2,575
Total CPU time in sec 2 E5 2,887
Optimality condition 0.15 (not achieved) 0.15 (achieved)
Number of discrete variables in
the formulation
1,066 1,066
Number of Single inequalities 13,623 18,979 (39.3%)
Termination condition after 5 integer solutions achieved optimal result
within tolerance
101. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The fifth-order Elliptic Wave Filter benchmark
Ë Consists of 34 operations(8 multiplications and 26 additions)
++++
+
Z
Z
+*
+
+
+
+
+ *
+
+ +
Z
+
*
+
*
+
Z
+
+ +
*
*
+
+
+
Z
+ +
*
Z
Z
*
+
input
output
102. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The DFG of the EWF benchmark
control
step
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
OUT
IN a b c d e f g h i
a b c d e f g h i
1 25
2715
6 26
16
19
7 20
21
2822
103
11
3132
2 23
5 13
3414
9
4
8
12
17
18
24
29
30
33
1
2
3 4
5
6
7
8
9 10
11
12
13
14
15
32
28
33
36
302724
25
20
18
19
50
17
35 54
42
41
4038
39
43
45
44
16
37
56
57
53
55
58
29
48
51
34
52
21
22
23
26
31
47
46
49
103. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Effect of Chaining AND Pipelining FUs On Datapath Performance.
Cost ( Number of CLBs)
Totexec, Λ, ns
1- 1+,1*
3-Non-Pipe
4
5 6
8
9
10
11
2-a-Bus-ours
• 7
Exploration of the Design Space for the EWF benchmark.
2-b-Best-others
2-pipe
4-pipe
105. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Final FGPA Implementation on Xilinx4000 series. †
† Using XACT 5.0 tools, the best area architecture would fit into x4006 chip and require about
200 CLBs.
Our Best Area
Our Best Perfor-
mance
Best in Litera-
ture(Simulated Evo-
lution)
Controller 33 27 30
Register_File 10 Not used Not used
ROM 4 4 4
Multiplier 110 110 110
Adder 10 10 10
4/3/2 to 1 mux 16/8 16/16/8 16/16/8
Register /Tristate 8/1 8/1 8/1
7/6/5/To 1 Mux Not used 36/26/25 36/26/25
Total # CLBS: 323 391 361
Total Execution time
(nsec):
1275 731 1275
108. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Synthesized architecture for the AR filter
Ë Resources:2 Multiplier (2-stage Pipelined),2 Adders and uses 3-registers and 12-
multiplexer inputs.
R1
R3
R2
A1
M1
M2
A2
109. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The scheduling and binding for the AR filter, using 4-stage pipelined multipliers
1
2
3
4
5
6
7
8
9
10
11
12
13
1 2
3 4
5 6
8 7
9
10
11
12
13
14
16
15
17
18
19
22 21
23
20
24
25
26
27
28
111. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The Fast Discrete Cosine Transform.
Ours SODAS-DSP MARS
Resources 2*, 2+,2- 2*, 2+,2- 2*, 2+, 2-
# mux inputs 37 66 NA
# registers 13 47 NA
Clock (ns) 60 100 NA
# csteps 10 12, dii=8a
a. dii is the data initiation rate for the Pipelined architecture used in SODAS-DSP.
8b
b. MARS, reports 8 cycles. No other details of the scheduling is available.
Totexec(ns) 600 1200 NA
Throughputc (MHz)
c. Throughput indicates the highest input-sampling rate of the architecture.
1.67 1.25 NA
112. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Synthesized Architecture for the Fast Discrete Cosine Transform benchmark.
A2A1M1 M2S1S2
Ë Resources: 2 Multiplier and 2Adders and 2 Subtracters. Uses 13 registers, 37 mux
inputs.
114. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Synthesized Architecture for the Discrete Cosine Transform.
Ë Resources: 2 Multiplier (4-stage Pipelined) and 4Adders. Uses 11 registers, 28 mux inputs.
A3M1M2A1A2 A4
115. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Chaining paths for the Discrete Cosine Transform
M1
M2
A1 A2
A3 A4
A3M1M2A1A2 A4
116. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Chaining interconnections modeled for false paths detection
M1
M2
A1
A2
A3
A4
M1
M2
A1
A2
A3
A4
M1
M2
A1
A2
A3
A4
V1 V2 V3
117. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Synthesized Bus architecture of the DCT benchmark
Ë Resources 1 Multiplier (4-stage Pipelined) and 3 Adders/Subtracters. Uses 9 registers, 18 mux
inputs and 1 Bus.
Bus1
A1A2
A3
R1
ROM
R4
R7
R5
R6
R8
R2
R9
R3
Register
File
M
class1
α1
α2
β
class3
class4
α1
class3
α1
class5
β
startβ
α1/α2
α1/α2
β
class2
β
ββ
118. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Synthesized Random topology architecture for the DCT benchmark
Ë Resources: 1 Multiplier (4-pipe stages) and 3 Adders/Subtracters. Uses 10 registers and 24
mux inputs
A1 A2 A3
R2 R10R9 R7R5R8R6R4R1R3
ROM
119. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Synthesized Random topology architecture for the DCT benchmark
Ë Resources: 1 Multiplier (4-pipe stages) and 3 Adders/Subtracters. Uses 12 registers and 20
mux inputs.
A1 A2 A3
R2 R1R9R2 R3 R4 R5R6R7R8R11R12R10
ROM
M
120. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The Discrete Cosine Transform Benchmark
Ours PSGA_Syn,
[69]
Tool [23]
Chaudhuri/
Walker
SALSA[34]
(Chain)
SALSA[34]
Resources 2*, 4+ 3*,3+ 3*, 4+ 2*, 4+ 2*,4+
# mux inputs 28 NA NA NA 30
# registers 11 14 NA 15 13
Clock (ns) 45 120a
a. This tool does not use chaining nor pipelining for the DCT.
65b
b. The tool described in [23], does not use chaining.
135c
c. The level of chaining is not reported in [34]
65d
d. SALSA[34], does not determine the clock duration of the total execution. However, we have
used the same library for comparison
# csteps 11 18 9 8 11
Totexec(ns) 495 2160 585 1080 715
121. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
The Discrete Cosine Transform Benchmark
Ours PSGA_Syn
Tool in [69]
SALSA
(Chain)
[34]
OSTA no-Chain
[70]
Resources 1*, 3+ 3*,3+ 2*, 4+ 3*, 6+
# mux i/p 24 NA NA 38
# registers 10 14 15 24
Clock (ns) 45 120 130 120
# csteps, T 19 18 8 9
Totexec(ns) 855 2160 1080 1080
123. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
CONCLUSIONS
• Our architectural model is suitable for a broad base of technology
implementations. Specifically FPGAs including bus/SRAM based ones.
• Introduced optimization criteria for ILP solvers for Datapath Synthesis:
Ë Our model and criteria can be used for other solvers (e.g.stochastic).
• The approach:
Ë Scheduling with chaining and deep-pipelining of FUs while minimizing “Structural
Complexity ”.
Ë Optimization of the Total Execution time of the architecture, with clock cycle determination.
Ë followed by bus assignment if it is supported by the FPGA.
• This Approach has demonstrated that a discriminating search of a larger architectural space
can produce:
Ë Regular Architectures with minimuminterconnections, Low resources and Fast
Throughput.
124. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Contribution of this research
Ë Several interconnect minimization measures were incorporated in the formulation,
which significantly improve the quality of the resulting synthesized architectures.
Ë This was demonstrated for different benchmarks, where number of registers and
multiplexer inputs were consistently smaller in architectures synthesized with this
methodology as compared to previously published results. This is an important issue
in developing a tool geared toward technologies with scarce interconnect resources
such as FPGAs.
Ë For the first time, an Integral Linear Programming (ILP) formulation that includes
a non-tabular, non-restricted model of the system clock duration was developed. This
has proved to be a significant step in the modeling of the total execution time of the
architecture and as a result, successful performance minimization.
Ë The formulation of the architectural synthesis scheduling and binding as a
performance optimization problem rather than the mere minimization of the number
of control steps was presented. A theoretical linearization technique for the objective
function of this formulation was presented. It was demonstrated that this linearization
technique has negligible impact on the size of the problem.
125. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Contribution of this research
Ë Verification of the validity of the overall methodology by integrating this tool to logic
synthesis and back-end tools.
Ë The development of the set of valid inequalities for the scheduling and binding problem.
The identification and derivation of both the extended wheel graph inequalities and the
maximal clique inequalities. This guarantees the tightest formulation for schedules with n-
levels of chaining and multicycled/pipelined resources for the first time.
Ë An algorithmic approach for the generation of the minimum set of inequality
classes necessary for the general scheduling and binding problem is developed. This
algorithm explores a Hasse graph representing the scheduling problem. The algorithm
classifies all the maximal paths into maximal path classes. These classes can be
incorporated into the automatic generation of the maximal clique constraints. These
maximal clique constraints represent the tightest description of the scheduling and
binding problem with n-level chaining.
131. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
CDFG
-Data Storage Assignment
STEP-LAST: Register Allocation
STEP-4: ILP: Bus Insertion
-Bus transfer scheduling
-Bus allocation
-Storage Minimization
-Bus loading Minim.
-Interconnect minimization.
-Bus loading minimization.
- Scheduling and Binding
- Chaining of Operations
STEP-3: ILP: Random Topology
-Interconnect minimization.
- Clock cycle minimization +
- FU pipelining choice
ation of the numberMinimiz
of cycles.
OR
- Minimization of the total
execution time, (i.e. throughput
maximization).
- VHDL generation of the
Datapath and the Controller
- Heuristics to determine the lower bound on the
number of cycles.
- Heuristics to tighten the ASAP/ALAP values
under the given resource constraints.
DFG
-DFG exploration.
-Dynamic Set generation for chaining
-ILP constraint generation
STEP-2: C++: Constraint Generation for ILP
STEP-1: Scheduling Bounds
Tech
133. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Flow of the Back-End Tools
Ë Stage-2 uses Synopsys tools(logic synthesis and FPGA mapping), and stage-3 uses
Xilinx(xact tools) for PPR
138. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
ASAP Scheduling
Input: Data Flow Graph G
Output: node arrayint, schedule_I, representing the As soon as possible
scheduling of the nodes of the DFG for a maximum chaining level
“Max_Chain_Length”.
ASAP{
1- G.for_all_nodes(v) {
if (input_degree(v) = 0)
{ schedule_I(v) = 1; }
else
{ schedule_I(v) = 0; insert v into the node set S; }
2- While (node set S ≠ Φ )
{
G.for_all_nodes(v) {
if ( (v ∈ S) and (all_pred_scheduled(G,v,schedule_I))
{
G.all_input_edges(e,v){
w = G.source(e);
if (G.type(w) and G.type(v) ≠ “multicycle”)
if ( Ch_Level_ASAP(w) ≤ Max_Chain_Length)
{ temp_schedule = schedule_I(w);}
else
{ temp_schedule = schedule_I(w) + delay(v);}
139. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
if ((G.type(w) = “multicycle”)
{
if (G.type(v) ≠ “multicycle”)
{ temp_schedule = schedule_I(w) + delay(w) -1 ;}
else
{ temp_schedule = schedule_I(w) + delay(w);}
}
if ( temp_schedule schedule_I(v))
{ schedule_I(v) = temp_schedule;}
}
3- Adj_Ch_Level_ASAP(G, v, schedule_I, Ch_Level_ASAP);
4- delete node v from the node set S;
} } } }
140. Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Adjust Chaining level of a node
Input: Data Flow Graph G, node v, node array representing the current schedule schedule_I, and the node array
representing ther current chaining level Ch_level_ASAP.
Output: Adjusted version of Ch_level_ASAP for node v, according to the current schedule schedule_I
Adj_Ch_Level_ASAP{
G.all_input_edges(e,v) {
w = G.source(e);
if ( ( G.type(w) ≠ “multicycle”) and (schedule_I(v) = schedule_I(w))
and (Ch_Level_ASAP(w) Max_Chain_Length)
and (Ch_Level_ASAP(v) Ch_Level_ASAP(w) + 1))
{Ch_Level_ASAP(v) = Ch_Level_ASAP(w) + 1;}
if ( ( G.type(w) = “multicycle”) and (G.type(v) ≠ multicycle”)
and (schedule_I(v) = schedule_I(w) + mul_delay -1)
and (Ch_Level_ASAP(v) ≤ 2))
{ Ch_Level_ASAP(v) = Ch_Level_ASAP(w) + 1; }
}