This document describes a Malleable Hardware Accelerator (MAHA) for data intensive applications. MAHA implements a reconfigurable computing fabric within off-chip memory to enable computation within the memory. It presents modifications made to a NAND flash memory architecture to realize the MAHA framework, including adding control logic and custom datapaths to memory blocks. It also describes the software architecture for mapping applications to MAHA, including decomposing and fusing operations. Evaluation results show the area and energy overhead of MAHA and performance/energy improvements when mapping applications compared to a general-purpose processor.
Efficient Data Center Virtualization with QLogic 10GbE Solutions from HPJone Smith
QLogic 10GbE network solutions from HP delivers significant operational savings, high performance, reduced management complexity and improved scalability for virtualized data center deployments.
Storage Networking Solutions for High Performance Databases by QLogicJone Smith
QLogic storage networking solutions, designed for high I/O throughput performance and fine grain QoS with dynamic bandwidth provisioning is well suited for data warehouse deployments.
In this session you will learn:
Meet MapReduce
Word Count Algorithm – Traditional approach
Traditional approach on a Distributed System
Traditional approach – Drawbacks
MapReduce Approach
Input & Output Forms of a MR program
Map, Shuffle & Sort, Reduce Phase
WordCount Code walkthrough
Workflow & Transformation of Data
Input Split & HDFS Block
Relation between Split & Block
Data locality Optimization
Speculative Execution
MR Flow with Single Reduce Task
MR flow with multiple Reducers
Input Format & Hierarchy
Output Format & Hierarchy
To know more, click here: https://www.mindsmapped.com/courses/big-data-hadoop/big-data-and-hadoop-training-for-beginners/
Efficient Data Center Virtualization with QLogic 10GbE Solutions from HPJone Smith
QLogic 10GbE network solutions from HP delivers significant operational savings, high performance, reduced management complexity and improved scalability for virtualized data center deployments.
Storage Networking Solutions for High Performance Databases by QLogicJone Smith
QLogic storage networking solutions, designed for high I/O throughput performance and fine grain QoS with dynamic bandwidth provisioning is well suited for data warehouse deployments.
In this session you will learn:
Meet MapReduce
Word Count Algorithm – Traditional approach
Traditional approach on a Distributed System
Traditional approach – Drawbacks
MapReduce Approach
Input & Output Forms of a MR program
Map, Shuffle & Sort, Reduce Phase
WordCount Code walkthrough
Workflow & Transformation of Data
Input Split & HDFS Block
Relation between Split & Block
Data locality Optimization
Speculative Execution
MR Flow with Single Reduce Task
MR flow with multiple Reducers
Input Format & Hierarchy
Output Format & Hierarchy
To know more, click here: https://www.mindsmapped.com/courses/big-data-hadoop/big-data-and-hadoop-training-for-beginners/
White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...EMC
This White Paper explores backing up EMC Greenplum Data Computing Appliance data to Data Domain systems and how to effectively exploit Data DomainTs leading-edge technology.
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainMDC_UNICA
Flexibility and high efficiency are common design drivers in the embedded systems domain. Coarse-grained reconfigurable coprocessors can tackle these issues, but they suffer of complex design, debugging and applications mapping problems. In this paper, we propose an automated design flow that aids developers in design and managing coarse-grained reconfigurable coprocessors. It provides both the hardware IP and the software drivers, featuring two different levels of coupling with the host processor. The presented solution has been tested on a JPEG codec, targeting a commercial Xilinx Virtex-5 FPGA.
How Stream Processing Engines could take advantage of the modern hardware? What are the challenges and what the opportunities?
Presented by: Panagiotis Savvaidis, Andreas Kallinteris
Paper referred: https://jeyhunkarimov.github.io/assets/publications/zeuch-analyzing-efficient-stream-processing-on-modern-hardware.pdf
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATESBUKYABALAJI
ABSTRACT: Reversible logic shows a great potential
in the design of Low-power circuits. Remarkable work
has been done in design of basic arithmetic circuits.
Present day progress in sequential circuit design of
reversible logic circuits has shown new ways in
performance of Static random access memory
(SRAM) and Dynamic random access memory
(DRAM). As the memory size is increasing
exponentially, the power absorbed by memory cells is
also growing rapidly. In recent years reversible logic
has achieved great interest because of its low power
performances. This paper proposes a new SRAM
c e l l which u s e s Feynman gates. The proposed
SRAM cell shows reduction of 66% in terms of
quantum cost, 66% reduction in quantum delay, 60%
reduction in number of gates count and 50% reduction
in number of transistors count
As the core SQL processing engine of the Greenplum Unified Analytics Platform, the Greenplum Database delivers Industry leading performance for Big Data Analytics while scaling linearly on massively parallel processing clusters of standard x86 servers. This session reviews the product's underlying architecture, identify key differentiation areas, go deep into the new features introduced in Greenplum Database Release 4.2, and discuss our plans for 2012.
Architecture exploration of recent GPUs to analyze the efficiency of hardware...journalBEEI
This study analyzes the efficiency of parallel computational applications with the adoption of recent graphics processing units (GPUs). We investigate the impacts of the additional resources of recent architecture on the popular benchmarks compared with previous architecture. Our simulation results demonstrate that Pascal GPU architecture improves the performance by 273% on average compared to old-fashioned Fermi architecture. To evaluate the performance improvement depending on specific hardware resources, we divide the hardware resources into two types: computing and memory resources. Computing resources have bigger impact on performance improvement than memory resources in most of benchmarks. For Hotspot and B+ tree, the architecture adopting only enhanced computing resources can achieve similar performance gains of the architecture adopting both computing and memory resources. We also evaluate the influence of the number of warp schedulers in the SM (Streaming Multiprocessor) to the GPU performance in relationship with barrier waiting time. Based on these analyses, we propose the development direction for the future generation of GPUs.
Liquor detection through Automatic Motor locking system pptPankaj Singh
The main purpose behind this project is “Drunken driving detection”. Now-a-days, many accidents are happening because of the alcohol consumption of the driver. Thus drunk driving is a major reason of accidents in almost all countries all over the world
Latest ECE Projects Ideas In Various Electronics Technologieselprocus
These Latest ECE Projects Ideas In Various Electronics Technologies are helpful to Engineering Students.These projects ideas are suggested by our proffisionals.
https://www.elprocus.com
Visit our page to get more ideas on Latest ECE Projects Ideas In Various Electronics Technologies for Engineering Students.
Elprocus provides free verified electronic projects kits around the world with abstracts, circuit diagrams, and free electronic software. We provide guidance manual for Do It Yourself Kits (DIY) with the modules at best price along with free shipping.
White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...EMC
This White Paper explores backing up EMC Greenplum Data Computing Appliance data to Data Domain systems and how to effectively exploit Data DomainTs leading-edge technology.
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainMDC_UNICA
Flexibility and high efficiency are common design drivers in the embedded systems domain. Coarse-grained reconfigurable coprocessors can tackle these issues, but they suffer of complex design, debugging and applications mapping problems. In this paper, we propose an automated design flow that aids developers in design and managing coarse-grained reconfigurable coprocessors. It provides both the hardware IP and the software drivers, featuring two different levels of coupling with the host processor. The presented solution has been tested on a JPEG codec, targeting a commercial Xilinx Virtex-5 FPGA.
How Stream Processing Engines could take advantage of the modern hardware? What are the challenges and what the opportunities?
Presented by: Panagiotis Savvaidis, Andreas Kallinteris
Paper referred: https://jeyhunkarimov.github.io/assets/publications/zeuch-analyzing-efficient-stream-processing-on-modern-hardware.pdf
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATESBUKYABALAJI
ABSTRACT: Reversible logic shows a great potential
in the design of Low-power circuits. Remarkable work
has been done in design of basic arithmetic circuits.
Present day progress in sequential circuit design of
reversible logic circuits has shown new ways in
performance of Static random access memory
(SRAM) and Dynamic random access memory
(DRAM). As the memory size is increasing
exponentially, the power absorbed by memory cells is
also growing rapidly. In recent years reversible logic
has achieved great interest because of its low power
performances. This paper proposes a new SRAM
c e l l which u s e s Feynman gates. The proposed
SRAM cell shows reduction of 66% in terms of
quantum cost, 66% reduction in quantum delay, 60%
reduction in number of gates count and 50% reduction
in number of transistors count
As the core SQL processing engine of the Greenplum Unified Analytics Platform, the Greenplum Database delivers Industry leading performance for Big Data Analytics while scaling linearly on massively parallel processing clusters of standard x86 servers. This session reviews the product's underlying architecture, identify key differentiation areas, go deep into the new features introduced in Greenplum Database Release 4.2, and discuss our plans for 2012.
Architecture exploration of recent GPUs to analyze the efficiency of hardware...journalBEEI
This study analyzes the efficiency of parallel computational applications with the adoption of recent graphics processing units (GPUs). We investigate the impacts of the additional resources of recent architecture on the popular benchmarks compared with previous architecture. Our simulation results demonstrate that Pascal GPU architecture improves the performance by 273% on average compared to old-fashioned Fermi architecture. To evaluate the performance improvement depending on specific hardware resources, we divide the hardware resources into two types: computing and memory resources. Computing resources have bigger impact on performance improvement than memory resources in most of benchmarks. For Hotspot and B+ tree, the architecture adopting only enhanced computing resources can achieve similar performance gains of the architecture adopting both computing and memory resources. We also evaluate the influence of the number of warp schedulers in the SM (Streaming Multiprocessor) to the GPU performance in relationship with barrier waiting time. Based on these analyses, we propose the development direction for the future generation of GPUs.
Liquor detection through Automatic Motor locking system pptPankaj Singh
The main purpose behind this project is “Drunken driving detection”. Now-a-days, many accidents are happening because of the alcohol consumption of the driver. Thus drunk driving is a major reason of accidents in almost all countries all over the world
Latest ECE Projects Ideas In Various Electronics Technologieselprocus
These Latest ECE Projects Ideas In Various Electronics Technologies are helpful to Engineering Students.These projects ideas are suggested by our proffisionals.
https://www.elprocus.com
Visit our page to get more ideas on Latest ECE Projects Ideas In Various Electronics Technologies for Engineering Students.
Elprocus provides free verified electronic projects kits around the world with abstracts, circuit diagrams, and free electronic software. We provide guidance manual for Do It Yourself Kits (DIY) with the modules at best price along with free shipping.
Morph: Flexible Acceleration for 3D CNN-based Video Understanding.
Kartik Hegde, Rohit Agrawal, Yulun Yao, Christopher W. Fletcher
University of Illinois
2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
Timely genome analysis requires a fresh approach to platform design for big data problems. Louisiana State University has tested enterprise cluster deployments of Redis with a unique solution that allows flash memory to act as extended RAM. Learn about how this solution allows large amounts of data to be handled with a fraction of the memory needed for a typical deployment.
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
I'm going to discuss the efficiency/performance optimizations of different layers of the system. Starting from the lowest levels like hardware and drivers: these tunings can be applied to pretty much any high-load server. Then we’ll move to Linux kernel and its TCP/IP stack: these are the knobs you want to try on any of your TCP-heavy boxes. Finally, we’ll discuss library and application-level tunings, which are mostly applicable to HTTP servers in general and nginx/envoy specifically.
For each potential area of optimization I’ll try to give some background on latency/throughput tradeoffs (if any), monitoring guidelines, and, finally, suggest tunings for different workloads.
Also, I'll cover more theoretical approaches to performance analysis and the newly developed tooling like `bpftrace` and new `perf` features.
NEXGEN TECHNOLOGY as an efficient Software Training Center located at Pondicherry with IT Training on IEEE Projects in Android,IEEE IT B.Tech Student Projects, Android Projects Training with Placements Pondicherry, IEEE projects in pondicherry, final IEEE Projects in Pondicherry , MCA, BTech, BCA Projects in Pondicherry, Bulk IEEE PROJECTS IN Pondicherry.So far we have reached almost all engineering colleges located in Pondicherry and around 90km
Nexgen Technology Address:
Nexgen Technology
No :66,4th cross,Venkata nagar,
Near SBI ATM,
Puducherry.
Email Id: praveen@nexgenproject.com.
www.nexgenproject.com
Mobile: 9791938249
Telephone: 0413-2211159.
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...Arun Joseph
Uncore power and identification of power reduction opportunities is a critical aspect of future power-efficient micro-processor design.
We present a practical methodology for use in an industrial setting for deriving abstract analytical power models for selected key uncore elements.
We show that even with very few power event markers and a small set of stress marks, it is possible to develop accurate power models for uncore elements of a modern day chip.
We quantify the accuracy such models have in providing improved power proxies and predicting worst-case bounds on chip level inductive noise in future technologies.
Memory and Performance Isolation for a Multi-tenant Function-based Data-planeAJAY KHARAT
3 approaches to memory protection are as follows:
Memory safe language: Writing modules in memory safe language like Rust automatically manages memory but already there are NF’s written in C/C++ , rewriting them from scratch with Rust will takes lots of effort.
Hardware-based memory protection(fine-grained approach): overhead of MPX comes from loading/storing the individual bounds for every pointers in a program
Coarse-grained hardware protection: Divides memory space into modules based on tenancy , having 2 advantages:
Tenants cant access each other modules and reduces the size of bound table thus reducing lookup overhead.
Controller design for multichannel nand flash memory for higher efficiency in...eSAT Journals
Abstract Flash based storage system is presently very attractive in the market than the previous Magnetic disk drives. Flash memory is basically Non-Volatile memory; it means it can electrically erasable and programmable. The Flash memory has less power consumption and less latency compare to previous magnetic disk drives, this makes flash more attractive and popular. Flash memory basically comprises of three basic operations, they are Read, Write (program), and Erase. The Flash memory will be written in pages and will be erased in blocks. Functions of the multi-channel parallel controller are validated according to a wide spread of workloads. The proposed Flash controller develops its own method for the reorganization and mapping of invalid blocks in a Flash chip. This paper explains about new flash controller design and control signals for flash operations and also exploits the parallelism by using multiple channels or multiple controllers for single flash memory in order to reduce the further latency in read, write and erase operation. Keywords: NAND Flash Controller; Multichannel; SSD;
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
2. MAHA : An Energy Efficient Malleable
Hardware Accelerator For Data
Intensive Applications
Grace Abraham
Roll No: 01
VLSI & ES
3. CONTENTS
Dept. of ECE 3
MAHA : Malleable Hardware Accelerator
29/07/2015
• INTRODUCTION
• BACKGROUND AND MOTIVATION
• MAHA - OVERALL APPROACH
• NAND FLASH – A CASE STUDY
• SOFTWARE ARHITECTURE
• RESULTS
• CONCLUSION
4. Dept. of ECE 4
MAHA : Malleable Hardware Accelerator
29/07/2015
INTRODUCTION
• In the nanometer technology, power has emerged as primary
design constraint
• Ever increasing demand for low power and high performance
• Von-Neumann bottleneck (back & forth data transfer) barrier to
performance & energy scaling
• To improve efficiency use explicit parallelism
• Energy overhead due to data transfer from off-chip to on-chip
memory
Low Bandwidth
High latency
High energy
5. Dept. of ECE 5
MAHA : Malleable Hardware Accelerator
29/07/2015
• To overcome this, a Malleable Hardware Accelerator is
introduced
• MAHA :
Implements a
reconfigurable
computing fabric
in last level
memory
Enabling computing
within off chip
memory Fig 1 : Von-Neumann bottleneck and proposed MAHA
framework
6. • Choice of NAND flash technology for demonstration
• Previous investigations on Processing in memory (PIM)
• MAHA differs from PIM architecture
Achieves on-demand computation by design modifications to the
the off-chip nonvolatile memory organization
High energy efficiency through parallelism & dynamic customization
• MAHA for data intensive applications
• Area and energy overheads are accurately estimated
• An efficient software flow for mapping applications to MAHA is
presented
Dept. of ECE 6
MAHA : Malleable Hardware Accelerator
29/07/2015
7. Dept. of ECE 7
MAHA : Malleable Hardware Accelerator
29/07/2015
• Following sections includes
Von-Neumann bottleneck barrier
Introduces MAHA & its hardware architecture
Realization with a CMOS compatible NAND flash memory
Evaluation results for MAHA
8. Dept. of ECE 8
MAHA : Malleable Hardware Accelerator
29/07/2015
BACKGROUND & MOTIVATION
• PERFORMANCE BARRIER DUE TO VON-NEUMANN BOTTLENECK
• ENERGY BARRIER FOR DATA-INTENSIVE APPLICATIONS
Off chip BW scales poorly in comparison to on chip transistor density
On chip density is likely to improve by 16X from 2011 to 2022
Off chip BW expected to improve only by 40%
BW available inside flash array is 4.2x105 GB/s in contrast , at 16 bit
flash interface is only 100MB/s
Managing latency and energy for memory to achieve energy efficiency
To identify major hurdles to energy scaling
o Performance of ten common kernels were simulated
o System-level performance metrics, such as cache hit/miss frequency were noted
9. Dept. of ECE 9
MAHA : Malleable Hardware Accelerator
29/07/2015
From table,
o 73% of total energy expended is contributed by access to on-chip instruction & data
cache
o 26% invested in useful computations, including fetch and decode operations
Table 1 : Energy breakdown for a conventional processor executing common computational kernels
10. Dept. of ECE 10
MAHA : Malleable Hardware Accelerator
29/07/2015
• MITIGATING VON-NEUMANN BOTTLENECK THROUGH IN-
MEMORY COMPUTING
75% of energy in a processor is dissipated in data transport
Optimizing the compute model for data-intensive tasks can cause
large improvements in energy efficiency
Two implications for compute model
o Relocate compute resources closer to last level of nonvolatile storage
o Minimizes overhead for data transfer to on-chip execution units
o Replace conventional software pipeline & caches with distributed memory
infrastructure
o Minimizes memory & interconnect memory power dissipation
11. Dept. of ECE 11
MAHA : Malleable Hardware Accelerator
29/07/2015
MAHA-OVERALL APPROACH
HARDWARE ARCHITECTURE
• MAHA is a hardware reconfigurable framework
• Consists of an array of processing elements (PEs)
• Communication using a hierarchical interconnect architecture
• Target application to be mapped is represented as Control &
data flow graph (CDFG)
• Software flow partitions CDFG into smaller multiple-input
multiple output tasks
• Tasks are mapped to individual PEs
12. Dept. of ECE 12
MAHA : Malleable Hardware Accelerator
29/07/2015
1) COMPUTE LOGIC
2) INTERCONNECT FABRIC
Each compute block or PE is referred to as memory logic block (MLB)
A single MLB includes a dense 2D memory array which stores lookup
table, data
A custom data path with arithmetic units
A local register file for storing temporary outputs from memory
Sequence of operations inside an MLB is controlled by μ-code
controller referred to as a schedule table
Tasks mapped to different MLBs communicate via a programmable &
hierarchical interconnect
Interconnect is time-multiplexed & shared among multiple MLBs
13. Dept. of ECE 13
MAHA : Malleable Hardware Accelerator
29/07/2015
Fig 2 : (a) Application mapping flow for MAHA
(b) μ-arch details of a single computing block (MLB)
(c) Synchronization among multiple MLBs over shared interconnect
14. Dept. of ECE 14
MAHA : Malleable Hardware Accelerator
29/07/2015
Sig1 & Sig2 are outputs of MLB A & B at end of cycle 1
Sig3 & Sig4 are outputs at end of cycle 2
Signals at end of each cycle are transmitted over same local/global to
MLB C
Significant gains in energy efficiency can be obtained by computing
inside the NVM
MAHA is an attractive low-overhead & energy efficient candidate for
in-memory computing
In NVM-based MAHA model,
o Multiple NVM arrays are grouped to form a single MLB
o Each MLB process its local data, communicates with other MLBs
o Distribution of data to multiple MLBs through flash translation layer for mapping
logical address to a physical location in NVM
o Static CMOS logic integrated with NVM to realize MLB
15. Dept. of ECE 15
MAHA : Malleable Hardware Accelerator
29/07/2015
COMPARISON WITH ALTERNATE ACCELERATORS
• Computing Model
• Granularity of computations
Frameworks that do not inherent hardware support for spatio-
temporal computing - FPGA, Chimaera, Piperench & Rapid
Frameworks that support spatio-temporal execution-MATRIX,
Morphosys
MAHA is also a spatio-temporal computing framework
Defined as width of smallest PE
Based on granularity, frameworks are classified as
MAHA is a mixed granular computing framework
o Fine- grained
o Coarse-grained
o Mixed granular
16. Dept. of ECE 16
MAHA : Malleable Hardware Accelerator
29/07/2015
• Computing Fabric
• Target Application Domain
Hardware accelerators proposed earlier used fine grained 1-D lookup
tables
MAHA uses memory for storage & mapping 1 or more multiple input
multiple output LUTs
Hardware accelerators proposed earlier target a wide application
space, bit-level computations, signal processing, image processing
MAHA improve system energy for a variety of data-intensive
applications
17. Dept. of ECE 17
MAHA : Malleable Hardware Accelerator
29/07/2015
NAND FLASH – A CASE STUDY
• Hardware architecture for an off chip MAHA framework based
on CMOS-compatible single level cell (SLC) NAND flash memory
array
• CMOS compatibility allows
• Due to availability of open-source area, power & delay models
SLC is considered
Integration of MLB controllers, registers, datapath and PI
Realization using CMOS logic
18. Dept. of ECE 18
MAHA : Malleable Hardware Accelerator
29/07/2015
• OVERVIEW OF CURRENT FLASH ORGANISATION
Organisation of nand flash memory with flash array & no. of logic
structures
For Normal Flash read,
o 8-b or 16-b I/O bandwidth
o Organized in units of pages & blocks
o Page size – 2KB
o Each block have 64-128 pages
o Block decoder first selects one of the blocks
o Page decoder selects one of the pages
o Content of entire page is first read into page register
o Transferred to flash external interface
Table 2 : Flash Organization and
performance
19. Dept. of ECE 19
MAHA : Malleable Hardware Accelerator
29/07/2015
Figure 3: Modifications to conventional flash memory to realize MAHA framework.
A small control engine outside the memory array is added to initiate & synchronize parallel operations
inside the memory array
20. Dept. of ECE 20
MAHA : Malleable Hardware Accelerator
29/07/2015
• MODIFICATIONS TO FLASH ARRAY ORGANIZATION
Modifications to achieve on-demand computation
Without affecting normal read/write operation
1) Compute Logic Modifications
o Group of N flash blocks are clustered to form a single MLB
o In MLB, blocks are logically divided into LUT blocks & data blocks
o MLB control logic & custom datapath implemented using static CMOS logic
o A custom dual ported asynchronous read register file for storing intermediate
outputs
o A pass gate multiplexors & keep transistor are used for selecting operands
for LUT
o For Normal NAND flash read, entire page is read at once (2KB)
21. Dept. of ECE 21
MAHA : Malleable Hardware Accelerator
29/07/2015
o For LUT operations, due to smaller operand sizes a wide read is avoided
o We propose a narrow- read scheme for LUT blocks in which a fraction of a
page size is read at a time
o Hardware overhead due to word line segmentation
o To minimize overhead, we read only 64-b words from each block at a time
22. Dept. of ECE 22
MAHA : Malleable Hardware Accelerator
29/07/2015
o Advantage – It improves energy efficiency by lowering word line capacitance
o Combinational logic is used to switch between narrow read for MAHA
operation & full page read for normal flash operation
o They are used with narrow read decoder to control the AND gate for segmentation
o Segmentation for data blocks is coarse with data sizes of 4096 bits being read out
from each page and stored inside buffers
o A group of such LUT and data blocks constitute 1 MLB
o Two planes of the flash array are logically divided into 8 banks, each consists of
2 MLBs
o Each MLB contains
a. 256 blocks of flash memory
b. 1 LUT block
c. 255 data blocks
23. Dept. of ECE 23
MAHA : Malleable Hardware Accelerator
29/07/2015
Figure 4: Modified flash memory array for on-demand reconfigurable computing.
The memory blocks are augmented with local control and compute logic to act as a
hardware reconfigurable unit
24. Dept. of ECE 24
MAHA : Malleable Hardware Accelerator
29/07/2015
2) Routing logic modifications
o Each block communicates with the page register over a shared bus
o To minimize the inter MLB PI overhead, a set of hierarchical buses with a
at each level to select the source of incoming data
o 4 levels – banks, sub banks, subarrays
Figure 5 : Hierarchical interconnect architecture to connect a group of MLB’s
25. Dept. of ECE 25
MAHA : Malleable Hardware Accelerator
29/07/2015
SOFTWARE ARCHITECTURE
• Figure shows application mapping for the proposed
acceleration platform.
• Mapper (application mapping tool ) was developed in C
• Key features of software flow are
1) Description of input application using an ISA
Define an instruction set for the proposed MAHA framework that
common control as well as data flow operations
Operation types that are supported by software architecture :
o bitswC
o bits
o mult
o shift and rotate
o sel
o complex
o load & store
26. Dept. of ECE 26
MAHA : Malleable Hardware Accelerator
29/07/2015
Figure 6 : Application mapping flow for proposed MAHA framework
27. Dept. of ECE 27
MAHA : Malleable Hardware Accelerator
29/07/2015
2) Application Mapping to a mixed-granular time-multiplexed
computing fabric
The mapping process includes 2 key contributions
1) Decomposition of fine & coarse grained operations
o During decomposition of load/store operation, memory is allocated in 1
or more MLBs depending on the address size used for load/store & no.
of data blocks present inside each MLB
2) Fusing multiple LUT as well as custom datapath operations
o 3 fusion routines
1) Fusion of random LUT based operations
2) Fusion of bit-sliceable operations
3) Fusion of custom-datapath operations
o In all these, decomposed CDFG is first partitioned into 1 or more vertices
28. Dept. of ECE 28
MAHA : Malleable Hardware Accelerator
29/07/2015
3) Placement & routing for hierarchical interconnect model :
Software tool places the MLBs in hierarchical fashion such that
no. of inputs & outputs crossing each module is minimized
In bi-partitioning approach, MLBs are first allocated to the first
level modules, then distributed among second-level modules
This continues until each MLB has been mapped to the
lowermost memory module
Routing of signals in the CDFG is performed in the following order
1) Routing of signals which cross each level of the memory hierarchy
2) Routing of primary outputs from each MLB for all levels of the cyclic
schedule
3) Routing of primary inputs to each MLB for all levels of the cyclic
schedule
29. Dept. of ECE 29
MAHA : Malleable Hardware Accelerator
29/07/2015
4) Functional validation of the proposed framework :
Bit file generation routine accepts the placed & routed netlist &
the control or select bits for the following
1) Configuration for programmable switches
2) Schedule table entries which control the sequence of
operations inside each MLB
3) LUT entries to be loaded into the function table
Bit file generated by the tool can be directly loaded into the
function table
30. Dept. of ECE 30
MAHA : Malleable Hardware Accelerator
29/07/2015
RESULTS
A. Design space exploration for MAHA
B. Energy , Performance, and Overhead estimation
Estimate design overhead for entire MLB as well as for inter-MLB PI
Map the benchmark applications to the MAHA framework
Calculate the area overhead, performance, and energy
requirements for each configurations & select best configuration
Cycle time of 20ns for MAHA operation – bitline precharge time (12ns)+
intra-MLB delay(3ns)+inter-MLB signal propagation time(5ns)
Area of single block of flash array-5*F2 * (Npages)*(pagesize)
Since LUT block is separate from data blocks, area overhead is different
31. Dept. of ECE 31
MAHA : Malleable Hardware Accelerator
29/07/2015
The parameters noted are :
C. Selection of optimal MAHA configuration
o Area overhead
o Latency
o Number of MLBs required to map application
o Total energy dissipation in the MLBs
o Area & energy for inter-MLB PI
o Size of reconfiguration data
o Final configuration
Figure 7: (a) Relative contribution of different components to total area of modified
flash(b) Relative contribution of memory & logic components
32. Dept. of ECE 32
MAHA : Malleable Hardware Accelerator
29/07/2015
D. Energy & performance for mapped applications
Mapping results for a single CDFG instantiation for each of the selected
benchmarks mapped to final MAHA hardware configuration
For MAHA, average PI energy is less compared with the average MLB
logic energy
33. Dept. of ECE 33
MAHA : Malleable Hardware Accelerator
29/07/2015
E. Comparison with a conventional GPP
1) Reduction in On-chip & off-chip communication
2) Improvement in execution latency
3) Improvement in energy
4) Improvement in EDP
34. Dept. of ECE 34
MAHA : Malleable Hardware Accelerator
29/07/2015
F. Comparison with FPGA & GPU
G. Hardware emulation based validation
On an average MAHA improves the energy requirement by 74% & 84%
over FPGA & GPU frameworks
MAHA eliminates the high energy overhead for transferring data from off-
chip memory to FPGA or GPU
We developed an FPGA –based emulation framework, which validates
1) Functionality & synchronization of multiple MLBs for several
application kernels
2) Interfacing the MAHA framework with the host processor
Emulation framework consists of 2 FPGA boards, one DE0, running a host
CPU, & a DE4, consisting of 3 main components
35. Dept. of ECE 35
MAHA : Malleable Hardware Accelerator
29/07/2015
o MAHA framework
o Flash controller
o on board flash memory
The last 2 boards communicate over 3-wire SPI in simple master/slave
configuration
The slave queries the flash for all available kernels, & upon finding a match,
begins a transfer of the configuration bits & data for processing to MAHA
framework .
If no match is found, the slave immediately responds with an error code
Otherwise slave will only interrupt the host CPU
36. Dept. of ECE 36
MAHA : Malleable Hardware Accelerator
29/07/2015
Figure 8 : (a) Overview for off-chip acceleration with MAHA framework
(b)System architecture for FPGA- based hardware emulation framework
(c) Improvement in latency & energy with MAHA –based off-chip acceleration
37. DISCUSSION
Before mapping a kernel to an-in memory accelerator, key applications &
system primitives can be used to determine whether it will benefit from in-
memory acceleration. These are listed below :
1) g—fraction of total instructions with memory reference (loads and stores);
2) f —fraction of total instructions transferred to an compute engine;
3) c—fraction of instructions translated from the host’s ISA to the ISA for the off-chip
compute framework
4) o—fraction of original instructions, which result in an output. A fraction f × c × o
thus produces outputs, which need to be transferred to the host processor;
5) eoffchip—average energy per instruction in the off-chip compute engine;
6) etxfer—energy expended in the transfer of an output from the off-chip framework
to the host processor;
7) toffchip—ratio of cycle time of the off-chip compute framework to that of the host
processor;
8) n—fraction of speedup due to parallelism in the framework
9) ttxfer—time taken in terms of processor clock cycles to transfer an output from the
off-chip compute framework to the host processor.
Dept. of ECE 37
MAHA : Malleable Hardware Accelerator
29/07/2015
38. Tsys = Toffchip + Tproc + Ttxfer
Esys = Eoffchip + Eproc + Etxfer
Figure 9 : Energy & performance for a hybrid system with a host processor &
off-chip memory based hardware accelerator
Dept. of ECE 38
MAHA : Malleable Hardware Accelerator
29/07/2015
39. Dept. of ECE 39
MAHA : Malleable Hardware Accelerator
29/07/2015
CONCLUSION
• MAHA , a hardware acceleration framework
• Greatly improve energy efficiency for data-intensive applications by
transferring computing kernal to last level of memory
• Design considerations to modify an SLC NAND flash memory for on-chip
reconfigurable computing are presented
• Improvement in energy efficiency
• Better efficiency compared to FPGA & GPU
• Future research efforts can be directed for optimizing the MLB
architecture, interconnect topology & mapper software
40. Dept. of ECE 40
MAHA : Malleable Hardware Accelerator
29/07/2015
REFERENCES
MAHA: An Energy-Efficient Malleable Hardware Accelerator for Data-
Intensive Applications Somnath Paul, Member, IEEE, Aswin Krishna,
Student Member, IEEE, Wenchao Qian, Student Member, IEEE, Robert
Karam, Student Member, IEEE, and Swarup Bhunia, Senior Member, IEEE
V. Govindaraju, C.-H. Ko, and K. Sankaralingam, “Dynamically specialized
datapaths for energy efficient computing,” in Proc. IEEE 17th Int.
Symp. High Perform. Comput. Archit. (HPCA), Feb. 2011, pp. 503–514
and more....
41. Dept. of ECE 41
MAHA : Malleable Hardware Accelerator
29/07/2015
THANK YOU