SlideShare a Scribd company logo
WELCOME
MAHA : An Energy Efficient Malleable
Hardware Accelerator For Data
Intensive Applications
Grace Abraham
Roll No: 01
VLSI & ES
CONTENTS
Dept. of ECE 3
MAHA : Malleable Hardware Accelerator
29/07/2015
• INTRODUCTION
• BACKGROUND AND MOTIVATION
• MAHA - OVERALL APPROACH
• NAND FLASH – A CASE STUDY
• SOFTWARE ARHITECTURE
• RESULTS
• CONCLUSION
Dept. of ECE 4
MAHA : Malleable Hardware Accelerator
29/07/2015
INTRODUCTION
• In the nanometer technology, power has emerged as primary
design constraint
• Ever increasing demand for low power and high performance
• Von-Neumann bottleneck (back & forth data transfer) barrier to
performance & energy scaling
• To improve efficiency use explicit parallelism
• Energy overhead due to data transfer from off-chip to on-chip
memory
 Low Bandwidth
 High latency
 High energy
Dept. of ECE 5
MAHA : Malleable Hardware Accelerator
29/07/2015
• To overcome this, a Malleable Hardware Accelerator is
introduced
• MAHA :
 Implements a
reconfigurable
computing fabric
in last level
memory
 Enabling computing
within off chip
memory Fig 1 : Von-Neumann bottleneck and proposed MAHA
framework
• Choice of NAND flash technology for demonstration
• Previous investigations on Processing in memory (PIM)
• MAHA differs from PIM architecture
 Achieves on-demand computation by design modifications to the
the off-chip nonvolatile memory organization
 High energy efficiency through parallelism & dynamic customization
• MAHA for data intensive applications
• Area and energy overheads are accurately estimated
• An efficient software flow for mapping applications to MAHA is
presented
Dept. of ECE 6
MAHA : Malleable Hardware Accelerator
29/07/2015
Dept. of ECE 7
MAHA : Malleable Hardware Accelerator
29/07/2015
• Following sections includes
 Von-Neumann bottleneck barrier
 Introduces MAHA & its hardware architecture
 Realization with a CMOS compatible NAND flash memory
 Evaluation results for MAHA
Dept. of ECE 8
MAHA : Malleable Hardware Accelerator
29/07/2015
BACKGROUND & MOTIVATION
• PERFORMANCE BARRIER DUE TO VON-NEUMANN BOTTLENECK
• ENERGY BARRIER FOR DATA-INTENSIVE APPLICATIONS
 Off chip BW scales poorly in comparison to on chip transistor density
 On chip density is likely to improve by 16X from 2011 to 2022
 Off chip BW expected to improve only by 40%
 BW available inside flash array is 4.2x105 GB/s in contrast , at 16 bit
flash interface is only 100MB/s
 Managing latency and energy for memory to achieve energy efficiency
 To identify major hurdles to energy scaling
o Performance of ten common kernels were simulated
o System-level performance metrics, such as cache hit/miss frequency were noted
Dept. of ECE 9
MAHA : Malleable Hardware Accelerator
29/07/2015
 From table,
o 73% of total energy expended is contributed by access to on-chip instruction & data
cache
o 26% invested in useful computations, including fetch and decode operations
Table 1 : Energy breakdown for a conventional processor executing common computational kernels
Dept. of ECE 10
MAHA : Malleable Hardware Accelerator
29/07/2015
• MITIGATING VON-NEUMANN BOTTLENECK THROUGH IN-
MEMORY COMPUTING
 75% of energy in a processor is dissipated in data transport
 Optimizing the compute model for data-intensive tasks can cause
large improvements in energy efficiency
 Two implications for compute model
o Relocate compute resources closer to last level of nonvolatile storage
o Minimizes overhead for data transfer to on-chip execution units
o Replace conventional software pipeline & caches with distributed memory
infrastructure
o Minimizes memory & interconnect memory power dissipation
Dept. of ECE 11
MAHA : Malleable Hardware Accelerator
29/07/2015
MAHA-OVERALL APPROACH
 HARDWARE ARCHITECTURE
• MAHA is a hardware reconfigurable framework
• Consists of an array of processing elements (PEs)
• Communication using a hierarchical interconnect architecture
• Target application to be mapped is represented as Control &
data flow graph (CDFG)
• Software flow partitions CDFG into smaller multiple-input
multiple output tasks
• Tasks are mapped to individual PEs
Dept. of ECE 12
MAHA : Malleable Hardware Accelerator
29/07/2015
1) COMPUTE LOGIC
2) INTERCONNECT FABRIC
 Each compute block or PE is referred to as memory logic block (MLB)
 A single MLB includes a dense 2D memory array which stores lookup
table, data
 A custom data path with arithmetic units
 A local register file for storing temporary outputs from memory
 Sequence of operations inside an MLB is controlled by μ-code
controller referred to as a schedule table
 Tasks mapped to different MLBs communicate via a programmable &
hierarchical interconnect
 Interconnect is time-multiplexed & shared among multiple MLBs
Dept. of ECE 13
MAHA : Malleable Hardware Accelerator
29/07/2015
Fig 2 : (a) Application mapping flow for MAHA
(b) μ-arch details of a single computing block (MLB)
(c) Synchronization among multiple MLBs over shared interconnect
Dept. of ECE 14
MAHA : Malleable Hardware Accelerator
29/07/2015
 Sig1 & Sig2 are outputs of MLB A & B at end of cycle 1
 Sig3 & Sig4 are outputs at end of cycle 2
 Signals at end of each cycle are transmitted over same local/global to
MLB C
 Significant gains in energy efficiency can be obtained by computing
inside the NVM
 MAHA is an attractive low-overhead & energy efficient candidate for
in-memory computing
 In NVM-based MAHA model,
o Multiple NVM arrays are grouped to form a single MLB
o Each MLB process its local data, communicates with other MLBs
o Distribution of data to multiple MLBs through flash translation layer for mapping
logical address to a physical location in NVM
o Static CMOS logic integrated with NVM to realize MLB
Dept. of ECE 15
MAHA : Malleable Hardware Accelerator
29/07/2015
 COMPARISON WITH ALTERNATE ACCELERATORS
• Computing Model
• Granularity of computations
 Frameworks that do not inherent hardware support for spatio-
temporal computing - FPGA, Chimaera, Piperench & Rapid
 Frameworks that support spatio-temporal execution-MATRIX,
Morphosys
 MAHA is also a spatio-temporal computing framework
 Defined as width of smallest PE
 Based on granularity, frameworks are classified as
 MAHA is a mixed granular computing framework
o Fine- grained
o Coarse-grained
o Mixed granular
Dept. of ECE 16
MAHA : Malleable Hardware Accelerator
29/07/2015
• Computing Fabric
• Target Application Domain
 Hardware accelerators proposed earlier used fine grained 1-D lookup
tables
 MAHA uses memory for storage & mapping 1 or more multiple input
multiple output LUTs
 Hardware accelerators proposed earlier target a wide application
space, bit-level computations, signal processing, image processing
 MAHA improve system energy for a variety of data-intensive
applications
Dept. of ECE 17
MAHA : Malleable Hardware Accelerator
29/07/2015
NAND FLASH – A CASE STUDY
• Hardware architecture for an off chip MAHA framework based
on CMOS-compatible single level cell (SLC) NAND flash memory
array
• CMOS compatibility allows
• Due to availability of open-source area, power & delay models
SLC is considered
 Integration of MLB controllers, registers, datapath and PI
 Realization using CMOS logic
Dept. of ECE 18
MAHA : Malleable Hardware Accelerator
29/07/2015
• OVERVIEW OF CURRENT FLASH ORGANISATION
 Organisation of nand flash memory with flash array & no. of logic
structures
 For Normal Flash read,
o 8-b or 16-b I/O bandwidth
o Organized in units of pages & blocks
o Page size – 2KB
o Each block have 64-128 pages
o Block decoder first selects one of the blocks
o Page decoder selects one of the pages
o Content of entire page is first read into page register
o Transferred to flash external interface
Table 2 : Flash Organization and
performance
Dept. of ECE 19
MAHA : Malleable Hardware Accelerator
29/07/2015
Figure 3: Modifications to conventional flash memory to realize MAHA framework.
A small control engine outside the memory array is added to initiate & synchronize parallel operations
inside the memory array
Dept. of ECE 20
MAHA : Malleable Hardware Accelerator
29/07/2015
• MODIFICATIONS TO FLASH ARRAY ORGANIZATION
 Modifications to achieve on-demand computation
 Without affecting normal read/write operation
1) Compute Logic Modifications
o Group of N flash blocks are clustered to form a single MLB
o In MLB, blocks are logically divided into LUT blocks & data blocks
o MLB control logic & custom datapath implemented using static CMOS logic
o A custom dual ported asynchronous read register file for storing intermediate
outputs
o A pass gate multiplexors & keep transistor are used for selecting operands
for LUT
o For Normal NAND flash read, entire page is read at once (2KB)
Dept. of ECE 21
MAHA : Malleable Hardware Accelerator
29/07/2015
o For LUT operations, due to smaller operand sizes a wide read is avoided
o We propose a narrow- read scheme for LUT blocks in which a fraction of a
page size is read at a time
o Hardware overhead due to word line segmentation
o To minimize overhead, we read only 64-b words from each block at a time
Dept. of ECE 22
MAHA : Malleable Hardware Accelerator
29/07/2015
o Advantage – It improves energy efficiency by lowering word line capacitance
o Combinational logic is used to switch between narrow read for MAHA
operation & full page read for normal flash operation
o They are used with narrow read decoder to control the AND gate for segmentation
o Segmentation for data blocks is coarse with data sizes of 4096 bits being read out
from each page and stored inside buffers
o A group of such LUT and data blocks constitute 1 MLB
o Two planes of the flash array are logically divided into 8 banks, each consists of
2 MLBs
o Each MLB contains
a. 256 blocks of flash memory
b. 1 LUT block
c. 255 data blocks
Dept. of ECE 23
MAHA : Malleable Hardware Accelerator
29/07/2015
Figure 4: Modified flash memory array for on-demand reconfigurable computing.
The memory blocks are augmented with local control and compute logic to act as a
hardware reconfigurable unit
Dept. of ECE 24
MAHA : Malleable Hardware Accelerator
29/07/2015
2) Routing logic modifications
o Each block communicates with the page register over a shared bus
o To minimize the inter MLB PI overhead, a set of hierarchical buses with a
at each level to select the source of incoming data
o 4 levels – banks, sub banks, subarrays
Figure 5 : Hierarchical interconnect architecture to connect a group of MLB’s
Dept. of ECE 25
MAHA : Malleable Hardware Accelerator
29/07/2015
SOFTWARE ARCHITECTURE
• Figure shows application mapping for the proposed
acceleration platform.
• Mapper (application mapping tool ) was developed in C
• Key features of software flow are
1) Description of input application using an ISA
 Define an instruction set for the proposed MAHA framework that
common control as well as data flow operations
 Operation types that are supported by software architecture :
o bitswC
o bits
o mult
o shift and rotate
o sel
o complex
o load & store
Dept. of ECE 26
MAHA : Malleable Hardware Accelerator
29/07/2015
Figure 6 : Application mapping flow for proposed MAHA framework
Dept. of ECE 27
MAHA : Malleable Hardware Accelerator
29/07/2015
2) Application Mapping to a mixed-granular time-multiplexed
computing fabric
 The mapping process includes 2 key contributions
1) Decomposition of fine & coarse grained operations
o During decomposition of load/store operation, memory is allocated in 1
or more MLBs depending on the address size used for load/store & no.
of data blocks present inside each MLB
2) Fusing multiple LUT as well as custom datapath operations
o 3 fusion routines
1) Fusion of random LUT based operations
2) Fusion of bit-sliceable operations
3) Fusion of custom-datapath operations
o In all these, decomposed CDFG is first partitioned into 1 or more vertices
Dept. of ECE 28
MAHA : Malleable Hardware Accelerator
29/07/2015
3) Placement & routing for hierarchical interconnect model :
 Software tool places the MLBs in hierarchical fashion such that
no. of inputs & outputs crossing each module is minimized
 In bi-partitioning approach, MLBs are first allocated to the first
level modules, then distributed among second-level modules
 This continues until each MLB has been mapped to the
lowermost memory module
 Routing of signals in the CDFG is performed in the following order
1) Routing of signals which cross each level of the memory hierarchy
2) Routing of primary outputs from each MLB for all levels of the cyclic
schedule
3) Routing of primary inputs to each MLB for all levels of the cyclic
schedule
Dept. of ECE 29
MAHA : Malleable Hardware Accelerator
29/07/2015
4) Functional validation of the proposed framework :
 Bit file generation routine accepts the placed & routed netlist &
the control or select bits for the following
1) Configuration for programmable switches
2) Schedule table entries which control the sequence of
operations inside each MLB
3) LUT entries to be loaded into the function table
 Bit file generated by the tool can be directly loaded into the
function table
Dept. of ECE 30
MAHA : Malleable Hardware Accelerator
29/07/2015
RESULTS
A. Design space exploration for MAHA
B. Energy , Performance, and Overhead estimation
 Estimate design overhead for entire MLB as well as for inter-MLB PI
 Map the benchmark applications to the MAHA framework
 Calculate the area overhead, performance, and energy
requirements for each configurations & select best configuration
 Cycle time of 20ns for MAHA operation – bitline precharge time (12ns)+
intra-MLB delay(3ns)+inter-MLB signal propagation time(5ns)
 Area of single block of flash array-5*F2 * (Npages)*(pagesize)
Since LUT block is separate from data blocks, area overhead is different
Dept. of ECE 31
MAHA : Malleable Hardware Accelerator
29/07/2015
 The parameters noted are :
C. Selection of optimal MAHA configuration
o Area overhead
o Latency
o Number of MLBs required to map application
o Total energy dissipation in the MLBs
o Area & energy for inter-MLB PI
o Size of reconfiguration data
o Final configuration
Figure 7: (a) Relative contribution of different components to total area of modified
flash(b) Relative contribution of memory & logic components
Dept. of ECE 32
MAHA : Malleable Hardware Accelerator
29/07/2015
D. Energy & performance for mapped applications
 Mapping results for a single CDFG instantiation for each of the selected
benchmarks mapped to final MAHA hardware configuration
 For MAHA, average PI energy is less compared with the average MLB
logic energy
Dept. of ECE 33
MAHA : Malleable Hardware Accelerator
29/07/2015
E. Comparison with a conventional GPP
1) Reduction in On-chip & off-chip communication
2) Improvement in execution latency
3) Improvement in energy
4) Improvement in EDP
Dept. of ECE 34
MAHA : Malleable Hardware Accelerator
29/07/2015
F. Comparison with FPGA & GPU
G. Hardware emulation based validation
 On an average MAHA improves the energy requirement by 74% & 84%
over FPGA & GPU frameworks
 MAHA eliminates the high energy overhead for transferring data from off-
chip memory to FPGA or GPU
 We developed an FPGA –based emulation framework, which validates
1) Functionality & synchronization of multiple MLBs for several
application kernels
2) Interfacing the MAHA framework with the host processor
 Emulation framework consists of 2 FPGA boards, one DE0, running a host
CPU, & a DE4, consisting of 3 main components
Dept. of ECE 35
MAHA : Malleable Hardware Accelerator
29/07/2015
o MAHA framework
o Flash controller
o on board flash memory
 The last 2 boards communicate over 3-wire SPI in simple master/slave
configuration
 The slave queries the flash for all available kernels, & upon finding a match,
begins a transfer of the configuration bits & data for processing to MAHA
framework .
 If no match is found, the slave immediately responds with an error code
 Otherwise slave will only interrupt the host CPU
Dept. of ECE 36
MAHA : Malleable Hardware Accelerator
29/07/2015
Figure 8 : (a) Overview for off-chip acceleration with MAHA framework
(b)System architecture for FPGA- based hardware emulation framework
(c) Improvement in latency & energy with MAHA –based off-chip acceleration
DISCUSSION
 Before mapping a kernel to an-in memory accelerator, key applications &
system primitives can be used to determine whether it will benefit from in-
memory acceleration. These are listed below :
1) g—fraction of total instructions with memory reference (loads and stores);
2) f —fraction of total instructions transferred to an compute engine;
3) c—fraction of instructions translated from the host’s ISA to the ISA for the off-chip
compute framework
4) o—fraction of original instructions, which result in an output. A fraction f × c × o
thus produces outputs, which need to be transferred to the host processor;
5) eoffchip—average energy per instruction in the off-chip compute engine;
6) etxfer—energy expended in the transfer of an output from the off-chip framework
to the host processor;
7) toffchip—ratio of cycle time of the off-chip compute framework to that of the host
processor;
8) n—fraction of speedup due to parallelism in the framework
9) ttxfer—time taken in terms of processor clock cycles to transfer an output from the
off-chip compute framework to the host processor.
Dept. of ECE 37
MAHA : Malleable Hardware Accelerator
29/07/2015
 Tsys = Toffchip + Tproc + Ttxfer
 Esys = Eoffchip + Eproc + Etxfer
Figure 9 : Energy & performance for a hybrid system with a host processor &
off-chip memory based hardware accelerator
Dept. of ECE 38
MAHA : Malleable Hardware Accelerator
29/07/2015
Dept. of ECE 39
MAHA : Malleable Hardware Accelerator
29/07/2015
CONCLUSION
• MAHA , a hardware acceleration framework
• Greatly improve energy efficiency for data-intensive applications by
transferring computing kernal to last level of memory
• Design considerations to modify an SLC NAND flash memory for on-chip
reconfigurable computing are presented
• Improvement in energy efficiency
• Better efficiency compared to FPGA & GPU
• Future research efforts can be directed for optimizing the MLB
architecture, interconnect topology & mapper software
Dept. of ECE 40
MAHA : Malleable Hardware Accelerator
29/07/2015
REFERENCES
 MAHA: An Energy-Efficient Malleable Hardware Accelerator for Data-
Intensive Applications Somnath Paul, Member, IEEE, Aswin Krishna,
Student Member, IEEE, Wenchao Qian, Student Member, IEEE, Robert
Karam, Student Member, IEEE, and Swarup Bhunia, Senior Member, IEEE
 V. Govindaraju, C.-H. Ko, and K. Sankaralingam, “Dynamically specialized
datapaths for energy efficient computing,” in Proc. IEEE 17th Int.
Symp. High Perform. Comput. Archit. (HPCA), Feb. 2011, pp. 503–514
and more....
Dept. of ECE 41
MAHA : Malleable Hardware Accelerator
29/07/2015
THANK YOU
QUERIES ????.....
Dept. of ECE 42
MAHA : Malleable Hardware Accelerator
29/07/2015

More Related Content

What's hot

White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
 White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian... White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
EMC
 
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainReconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
MDC_UNICA
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architecture
Michael Gschwind
 
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Dhiraj Chaudhary
 
SPE effiency on modern hardware paper presentation
SPE effiency on modern hardware   paper presentationSPE effiency on modern hardware   paper presentation
SPE effiency on modern hardware paper presentation
PanagiotisSavvaidis
 
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATESPERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
BUKYABALAJI
 
Greenplum: Driving the future of Data Warehousing and Analytics
Greenplum: Driving the future of Data Warehousing and AnalyticsGreenplum: Driving the future of Data Warehousing and Analytics
Greenplum: Driving the future of Data Warehousing and Analyticseaiti
 
Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview
EMC
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
journalBEEI
 
Greenplum Database on HDFS
Greenplum Database on HDFSGreenplum Database on HDFS
Greenplum Database on HDFSDataWorks Summit
 
Oracle real application_cluster
Oracle real application_clusterOracle real application_cluster
Oracle real application_cluster
Prabhat gangwar
 

What's hot (13)

White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
 White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian... White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
 
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainReconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architecture
 
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
 
SPE effiency on modern hardware paper presentation
SPE effiency on modern hardware   paper presentationSPE effiency on modern hardware   paper presentation
SPE effiency on modern hardware paper presentation
 
Accelerix ISSCC 1998 Paper
Accelerix ISSCC 1998 PaperAccelerix ISSCC 1998 Paper
Accelerix ISSCC 1998 Paper
 
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATESPERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
 
Greenplum: Driving the future of Data Warehousing and Analytics
Greenplum: Driving the future of Data Warehousing and AnalyticsGreenplum: Driving the future of Data Warehousing and Analytics
Greenplum: Driving the future of Data Warehousing and Analytics
 
Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
 
Architectures for parallel
Architectures for parallelArchitectures for parallel
Architectures for parallel
 
Greenplum Database on HDFS
Greenplum Database on HDFSGreenplum Database on HDFS
Greenplum Database on HDFS
 
Oracle real application_cluster
Oracle real application_clusterOracle real application_cluster
Oracle real application_cluster
 

Viewers also liked

Liquor detection through Automatic Motor locking system ppt
Liquor detection through Automatic Motor locking system pptLiquor detection through Automatic Motor locking system ppt
Liquor detection through Automatic Motor locking system ppt
Pankaj Singh
 
Automatic room light controller with bidirectional visitor counter
Automatic room light controller with bidirectional visitor counterAutomatic room light controller with bidirectional visitor counter
Automatic room light controller with bidirectional visitor counter
Niladri Dutta
 
Latest ECE Projects Ideas In Various Electronics Technologies
Latest ECE Projects Ideas In Various Electronics TechnologiesLatest ECE Projects Ideas In Various Electronics Technologies
Latest ECE Projects Ideas In Various Electronics Technologies
elprocus
 
Project report on self compacting concrete
Project report on self compacting concreteProject report on self compacting concrete
Project report on self compacting concreterajhoney
 
Embedded System Design Notes written by Arun Kumar G, Associate Professor, De...
Embedded System Design Notes written by Arun Kumar G, Associate Professor, De...Embedded System Design Notes written by Arun Kumar G, Associate Professor, De...
Embedded System Design Notes written by Arun Kumar G, Associate Professor, De...
Arunkumar Gowdru
 
Schindler case study
Schindler case studySchindler case study
Schindler case study
Rajesh Srivastava
 
wireless charging of mobile phones using microwave full seminar report
wireless charging of mobile phones using microwave full seminar reportwireless charging of mobile phones using microwave full seminar report
wireless charging of mobile phones using microwave full seminar reportHarish N Nayak
 
OLED 2014 PPT
OLED 2014 PPTOLED 2014 PPT
OLED 2014 PPT
Ananthkrishn
 
Automatic irrigation 1st review(ieee project ece dept)
Automatic irrigation 1st review(ieee project ece dept)Automatic irrigation 1st review(ieee project ece dept)
Automatic irrigation 1st review(ieee project ece dept)
Siddappa Dollin
 
Artificial eye
Artificial eyeArtificial eye
Artificial eye
Rakesh Mairembam
 
Wireless charging of mobilephones
Wireless charging of  mobilephonesWireless charging of  mobilephones
Wireless charging of mobilephonesPRADEEP Cheekatla
 

Viewers also liked (11)

Liquor detection through Automatic Motor locking system ppt
Liquor detection through Automatic Motor locking system pptLiquor detection through Automatic Motor locking system ppt
Liquor detection through Automatic Motor locking system ppt
 
Automatic room light controller with bidirectional visitor counter
Automatic room light controller with bidirectional visitor counterAutomatic room light controller with bidirectional visitor counter
Automatic room light controller with bidirectional visitor counter
 
Latest ECE Projects Ideas In Various Electronics Technologies
Latest ECE Projects Ideas In Various Electronics TechnologiesLatest ECE Projects Ideas In Various Electronics Technologies
Latest ECE Projects Ideas In Various Electronics Technologies
 
Project report on self compacting concrete
Project report on self compacting concreteProject report on self compacting concrete
Project report on self compacting concrete
 
Embedded System Design Notes written by Arun Kumar G, Associate Professor, De...
Embedded System Design Notes written by Arun Kumar G, Associate Professor, De...Embedded System Design Notes written by Arun Kumar G, Associate Professor, De...
Embedded System Design Notes written by Arun Kumar G, Associate Professor, De...
 
Schindler case study
Schindler case studySchindler case study
Schindler case study
 
wireless charging of mobile phones using microwave full seminar report
wireless charging of mobile phones using microwave full seminar reportwireless charging of mobile phones using microwave full seminar report
wireless charging of mobile phones using microwave full seminar report
 
OLED 2014 PPT
OLED 2014 PPTOLED 2014 PPT
OLED 2014 PPT
 
Automatic irrigation 1st review(ieee project ece dept)
Automatic irrigation 1st review(ieee project ece dept)Automatic irrigation 1st review(ieee project ece dept)
Automatic irrigation 1st review(ieee project ece dept)
 
Artificial eye
Artificial eyeArtificial eye
Artificial eye
 
Wireless charging of mobilephones
Wireless charging of  mobilephonesWireless charging of  mobilephones
Wireless charging of mobilephones
 

Similar to Maha an energy efficient malleable hardware accelerator for data intensive applications

Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Ahsan Javed Awan
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel accelerator
BaharJV
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Redis Labs
 
FPGA Implementation of High Speed AMBA Bus Architecture for Image Transmissio...
FPGA Implementation of High Speed AMBA Bus Architecture for Image Transmissio...FPGA Implementation of High Speed AMBA Bus Architecture for Image Transmissio...
FPGA Implementation of High Speed AMBA Bus Architecture for Image Transmissio...
IRJET Journal
 
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
Sara Alvarez
 
Literature survey presentation
Literature survey presentationLiterature survey presentation
Literature survey presentationKarthik Iyr
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
brief_ssd_dynamic_write_accel
brief_ssd_dynamic_write_accelbrief_ssd_dynamic_write_accel
brief_ssd_dynamic_write_accelDave Glen
 
Cache memory
Cache memoryCache memory
Cache memory
MohanChimanna
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
ScyllaDB
 
2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...
2017 18 ieee vlsi titles,IEEE 2017-18  BULK  NS2 PROJECTS TITLES,IEEE 2017-18...2017 18 ieee vlsi titles,IEEE 2017-18  BULK  NS2 PROJECTS TITLES,IEEE 2017-18...
2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...
Nexgen Technology
 
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...
Empirically Derived Abstractions in Uncore Power Modeling for a  Server-Class...Empirically Derived Abstractions in Uncore Power Modeling for a  Server-Class...
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...
Arun Joseph
 
Memory and Performance Isolation for a Multi-tenant Function-based Data-plane
Memory and Performance Isolation for a Multi-tenant Function-based Data-planeMemory and Performance Isolation for a Multi-tenant Function-based Data-plane
Memory and Performance Isolation for a Multi-tenant Function-based Data-plane
AJAY KHARAT
 
Embedded systems-unit-1
Embedded systems-unit-1Embedded systems-unit-1
Embedded systems-unit-1
Prabhu Mali
 
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
Léia de Sousa
 
Controller design for multichannel nand flash memory for higher efficiency in...
Controller design for multichannel nand flash memory for higher efficiency in...Controller design for multichannel nand flash memory for higher efficiency in...
Controller design for multichannel nand flash memory for higher efficiency in...
eSAT Journals
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Ahsan Javed Awan
 

Similar to Maha an energy efficient malleable hardware accelerator for data intensive applications (20)

Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel accelerator
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
 
FPGA Implementation of High Speed AMBA Bus Architecture for Image Transmissio...
FPGA Implementation of High Speed AMBA Bus Architecture for Image Transmissio...FPGA Implementation of High Speed AMBA Bus Architecture for Image Transmissio...
FPGA Implementation of High Speed AMBA Bus Architecture for Image Transmissio...
 
3
33
3
 
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
 
Literature survey presentation
Literature survey presentationLiterature survey presentation
Literature survey presentation
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
brief_ssd_dynamic_write_accel
brief_ssd_dynamic_write_accelbrief_ssd_dynamic_write_accel
brief_ssd_dynamic_write_accel
 
Cache memory
Cache memoryCache memory
Cache memory
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...
2017 18 ieee vlsi titles,IEEE 2017-18  BULK  NS2 PROJECTS TITLES,IEEE 2017-18...2017 18 ieee vlsi titles,IEEE 2017-18  BULK  NS2 PROJECTS TITLES,IEEE 2017-18...
2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...
 
On chip cache
On chip cacheOn chip cache
On chip cache
 
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...
Empirically Derived Abstractions in Uncore Power Modeling for a  Server-Class...Empirically Derived Abstractions in Uncore Power Modeling for a  Server-Class...
Empirically Derived Abstractions in Uncore Power Modeling for a Server-Class...
 
Memory and Performance Isolation for a Multi-tenant Function-based Data-plane
Memory and Performance Isolation for a Multi-tenant Function-based Data-planeMemory and Performance Isolation for a Multi-tenant Function-based Data-plane
Memory and Performance Isolation for a Multi-tenant Function-based Data-plane
 
Embedded systems-unit-1
Embedded systems-unit-1Embedded systems-unit-1
Embedded systems-unit-1
 
Cache
CacheCache
Cache
 
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
 
Controller design for multichannel nand flash memory for higher efficiency in...
Controller design for multichannel nand flash memory for higher efficiency in...Controller design for multichannel nand flash memory for higher efficiency in...
Controller design for multichannel nand flash memory for higher efficiency in...
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
 

More from Grace Abraham

Embedded system hardware architecture ii
Embedded system hardware architecture iiEmbedded system hardware architecture ii
Embedded system hardware architecture ii
Grace Abraham
 
Design and implementation of cmos rail to-rail operational amplifiers
Design and implementation of cmos rail to-rail operational amplifiersDesign and implementation of cmos rail to-rail operational amplifiers
Design and implementation of cmos rail to-rail operational amplifiers
Grace Abraham
 
Clock recovery in mesochronous systems and pleisochronous systems
Clock recovery in mesochronous systems and pleisochronous systemsClock recovery in mesochronous systems and pleisochronous systems
Clock recovery in mesochronous systems and pleisochronous systems
Grace Abraham
 
MEMS ACCELEROMETER BASED NONSPECIFIC – USER HAND GESTURE RECOGNITION
MEMS  ACCELEROMETER  BASED NONSPECIFIC – USER HAND GESTURE  RECOGNITIONMEMS  ACCELEROMETER  BASED NONSPECIFIC – USER HAND GESTURE  RECOGNITION
MEMS ACCELEROMETER BASED NONSPECIFIC – USER HAND GESTURE RECOGNITIONGrace Abraham
 
Implementation of 1 bit full adder using gate diffusion input (gdi) technique
Implementation of 1 bit full adder using gate diffusion input (gdi) techniqueImplementation of 1 bit full adder using gate diffusion input (gdi) technique
Implementation of 1 bit full adder using gate diffusion input (gdi) techniqueGrace Abraham
 
Rtl design optimizations and tradeoffs
Rtl design optimizations and tradeoffsRtl design optimizations and tradeoffs
Rtl design optimizations and tradeoffsGrace Abraham
 
A 128 kbit sram with an embedded energy monitoring circuit and sense amplifie...
A 128 kbit sram with an embedded energy monitoring circuit and sense amplifie...A 128 kbit sram with an embedded energy monitoring circuit and sense amplifie...
A 128 kbit sram with an embedded energy monitoring circuit and sense amplifie...Grace Abraham
 

More from Grace Abraham (7)

Embedded system hardware architecture ii
Embedded system hardware architecture iiEmbedded system hardware architecture ii
Embedded system hardware architecture ii
 
Design and implementation of cmos rail to-rail operational amplifiers
Design and implementation of cmos rail to-rail operational amplifiersDesign and implementation of cmos rail to-rail operational amplifiers
Design and implementation of cmos rail to-rail operational amplifiers
 
Clock recovery in mesochronous systems and pleisochronous systems
Clock recovery in mesochronous systems and pleisochronous systemsClock recovery in mesochronous systems and pleisochronous systems
Clock recovery in mesochronous systems and pleisochronous systems
 
MEMS ACCELEROMETER BASED NONSPECIFIC – USER HAND GESTURE RECOGNITION
MEMS  ACCELEROMETER  BASED NONSPECIFIC – USER HAND GESTURE  RECOGNITIONMEMS  ACCELEROMETER  BASED NONSPECIFIC – USER HAND GESTURE  RECOGNITION
MEMS ACCELEROMETER BASED NONSPECIFIC – USER HAND GESTURE RECOGNITION
 
Implementation of 1 bit full adder using gate diffusion input (gdi) technique
Implementation of 1 bit full adder using gate diffusion input (gdi) techniqueImplementation of 1 bit full adder using gate diffusion input (gdi) technique
Implementation of 1 bit full adder using gate diffusion input (gdi) technique
 
Rtl design optimizations and tradeoffs
Rtl design optimizations and tradeoffsRtl design optimizations and tradeoffs
Rtl design optimizations and tradeoffs
 
A 128 kbit sram with an embedded energy monitoring circuit and sense amplifie...
A 128 kbit sram with an embedded energy monitoring circuit and sense amplifie...A 128 kbit sram with an embedded energy monitoring circuit and sense amplifie...
A 128 kbit sram with an embedded energy monitoring circuit and sense amplifie...
 

Recently uploaded

PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.pptPROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
bhadouriyakaku
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
nooriasukmaningtyas
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
yokeleetan1
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
dxobcob
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
ssuser36d3051
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 

Recently uploaded (20)

PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.pptPROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 

Maha an energy efficient malleable hardware accelerator for data intensive applications

  • 2. MAHA : An Energy Efficient Malleable Hardware Accelerator For Data Intensive Applications Grace Abraham Roll No: 01 VLSI & ES
  • 3. CONTENTS Dept. of ECE 3 MAHA : Malleable Hardware Accelerator 29/07/2015 • INTRODUCTION • BACKGROUND AND MOTIVATION • MAHA - OVERALL APPROACH • NAND FLASH – A CASE STUDY • SOFTWARE ARHITECTURE • RESULTS • CONCLUSION
  • 4. Dept. of ECE 4 MAHA : Malleable Hardware Accelerator 29/07/2015 INTRODUCTION • In the nanometer technology, power has emerged as primary design constraint • Ever increasing demand for low power and high performance • Von-Neumann bottleneck (back & forth data transfer) barrier to performance & energy scaling • To improve efficiency use explicit parallelism • Energy overhead due to data transfer from off-chip to on-chip memory  Low Bandwidth  High latency  High energy
  • 5. Dept. of ECE 5 MAHA : Malleable Hardware Accelerator 29/07/2015 • To overcome this, a Malleable Hardware Accelerator is introduced • MAHA :  Implements a reconfigurable computing fabric in last level memory  Enabling computing within off chip memory Fig 1 : Von-Neumann bottleneck and proposed MAHA framework
  • 6. • Choice of NAND flash technology for demonstration • Previous investigations on Processing in memory (PIM) • MAHA differs from PIM architecture  Achieves on-demand computation by design modifications to the the off-chip nonvolatile memory organization  High energy efficiency through parallelism & dynamic customization • MAHA for data intensive applications • Area and energy overheads are accurately estimated • An efficient software flow for mapping applications to MAHA is presented Dept. of ECE 6 MAHA : Malleable Hardware Accelerator 29/07/2015
  • 7. Dept. of ECE 7 MAHA : Malleable Hardware Accelerator 29/07/2015 • Following sections includes  Von-Neumann bottleneck barrier  Introduces MAHA & its hardware architecture  Realization with a CMOS compatible NAND flash memory  Evaluation results for MAHA
  • 8. Dept. of ECE 8 MAHA : Malleable Hardware Accelerator 29/07/2015 BACKGROUND & MOTIVATION • PERFORMANCE BARRIER DUE TO VON-NEUMANN BOTTLENECK • ENERGY BARRIER FOR DATA-INTENSIVE APPLICATIONS  Off chip BW scales poorly in comparison to on chip transistor density  On chip density is likely to improve by 16X from 2011 to 2022  Off chip BW expected to improve only by 40%  BW available inside flash array is 4.2x105 GB/s in contrast , at 16 bit flash interface is only 100MB/s  Managing latency and energy for memory to achieve energy efficiency  To identify major hurdles to energy scaling o Performance of ten common kernels were simulated o System-level performance metrics, such as cache hit/miss frequency were noted
  • 9. Dept. of ECE 9 MAHA : Malleable Hardware Accelerator 29/07/2015  From table, o 73% of total energy expended is contributed by access to on-chip instruction & data cache o 26% invested in useful computations, including fetch and decode operations Table 1 : Energy breakdown for a conventional processor executing common computational kernels
  • 10. Dept. of ECE 10 MAHA : Malleable Hardware Accelerator 29/07/2015 • MITIGATING VON-NEUMANN BOTTLENECK THROUGH IN- MEMORY COMPUTING  75% of energy in a processor is dissipated in data transport  Optimizing the compute model for data-intensive tasks can cause large improvements in energy efficiency  Two implications for compute model o Relocate compute resources closer to last level of nonvolatile storage o Minimizes overhead for data transfer to on-chip execution units o Replace conventional software pipeline & caches with distributed memory infrastructure o Minimizes memory & interconnect memory power dissipation
  • 11. Dept. of ECE 11 MAHA : Malleable Hardware Accelerator 29/07/2015 MAHA-OVERALL APPROACH  HARDWARE ARCHITECTURE • MAHA is a hardware reconfigurable framework • Consists of an array of processing elements (PEs) • Communication using a hierarchical interconnect architecture • Target application to be mapped is represented as Control & data flow graph (CDFG) • Software flow partitions CDFG into smaller multiple-input multiple output tasks • Tasks are mapped to individual PEs
  • 12. Dept. of ECE 12 MAHA : Malleable Hardware Accelerator 29/07/2015 1) COMPUTE LOGIC 2) INTERCONNECT FABRIC  Each compute block or PE is referred to as memory logic block (MLB)  A single MLB includes a dense 2D memory array which stores lookup table, data  A custom data path with arithmetic units  A local register file for storing temporary outputs from memory  Sequence of operations inside an MLB is controlled by μ-code controller referred to as a schedule table  Tasks mapped to different MLBs communicate via a programmable & hierarchical interconnect  Interconnect is time-multiplexed & shared among multiple MLBs
  • 13. Dept. of ECE 13 MAHA : Malleable Hardware Accelerator 29/07/2015 Fig 2 : (a) Application mapping flow for MAHA (b) μ-arch details of a single computing block (MLB) (c) Synchronization among multiple MLBs over shared interconnect
  • 14. Dept. of ECE 14 MAHA : Malleable Hardware Accelerator 29/07/2015  Sig1 & Sig2 are outputs of MLB A & B at end of cycle 1  Sig3 & Sig4 are outputs at end of cycle 2  Signals at end of each cycle are transmitted over same local/global to MLB C  Significant gains in energy efficiency can be obtained by computing inside the NVM  MAHA is an attractive low-overhead & energy efficient candidate for in-memory computing  In NVM-based MAHA model, o Multiple NVM arrays are grouped to form a single MLB o Each MLB process its local data, communicates with other MLBs o Distribution of data to multiple MLBs through flash translation layer for mapping logical address to a physical location in NVM o Static CMOS logic integrated with NVM to realize MLB
  • 15. Dept. of ECE 15 MAHA : Malleable Hardware Accelerator 29/07/2015  COMPARISON WITH ALTERNATE ACCELERATORS • Computing Model • Granularity of computations  Frameworks that do not inherent hardware support for spatio- temporal computing - FPGA, Chimaera, Piperench & Rapid  Frameworks that support spatio-temporal execution-MATRIX, Morphosys  MAHA is also a spatio-temporal computing framework  Defined as width of smallest PE  Based on granularity, frameworks are classified as  MAHA is a mixed granular computing framework o Fine- grained o Coarse-grained o Mixed granular
  • 16. Dept. of ECE 16 MAHA : Malleable Hardware Accelerator 29/07/2015 • Computing Fabric • Target Application Domain  Hardware accelerators proposed earlier used fine grained 1-D lookup tables  MAHA uses memory for storage & mapping 1 or more multiple input multiple output LUTs  Hardware accelerators proposed earlier target a wide application space, bit-level computations, signal processing, image processing  MAHA improve system energy for a variety of data-intensive applications
  • 17. Dept. of ECE 17 MAHA : Malleable Hardware Accelerator 29/07/2015 NAND FLASH – A CASE STUDY • Hardware architecture for an off chip MAHA framework based on CMOS-compatible single level cell (SLC) NAND flash memory array • CMOS compatibility allows • Due to availability of open-source area, power & delay models SLC is considered  Integration of MLB controllers, registers, datapath and PI  Realization using CMOS logic
  • 18. Dept. of ECE 18 MAHA : Malleable Hardware Accelerator 29/07/2015 • OVERVIEW OF CURRENT FLASH ORGANISATION  Organisation of nand flash memory with flash array & no. of logic structures  For Normal Flash read, o 8-b or 16-b I/O bandwidth o Organized in units of pages & blocks o Page size – 2KB o Each block have 64-128 pages o Block decoder first selects one of the blocks o Page decoder selects one of the pages o Content of entire page is first read into page register o Transferred to flash external interface Table 2 : Flash Organization and performance
  • 19. Dept. of ECE 19 MAHA : Malleable Hardware Accelerator 29/07/2015 Figure 3: Modifications to conventional flash memory to realize MAHA framework. A small control engine outside the memory array is added to initiate & synchronize parallel operations inside the memory array
  • 20. Dept. of ECE 20 MAHA : Malleable Hardware Accelerator 29/07/2015 • MODIFICATIONS TO FLASH ARRAY ORGANIZATION  Modifications to achieve on-demand computation  Without affecting normal read/write operation 1) Compute Logic Modifications o Group of N flash blocks are clustered to form a single MLB o In MLB, blocks are logically divided into LUT blocks & data blocks o MLB control logic & custom datapath implemented using static CMOS logic o A custom dual ported asynchronous read register file for storing intermediate outputs o A pass gate multiplexors & keep transistor are used for selecting operands for LUT o For Normal NAND flash read, entire page is read at once (2KB)
  • 21. Dept. of ECE 21 MAHA : Malleable Hardware Accelerator 29/07/2015 o For LUT operations, due to smaller operand sizes a wide read is avoided o We propose a narrow- read scheme for LUT blocks in which a fraction of a page size is read at a time o Hardware overhead due to word line segmentation o To minimize overhead, we read only 64-b words from each block at a time
  • 22. Dept. of ECE 22 MAHA : Malleable Hardware Accelerator 29/07/2015 o Advantage – It improves energy efficiency by lowering word line capacitance o Combinational logic is used to switch between narrow read for MAHA operation & full page read for normal flash operation o They are used with narrow read decoder to control the AND gate for segmentation o Segmentation for data blocks is coarse with data sizes of 4096 bits being read out from each page and stored inside buffers o A group of such LUT and data blocks constitute 1 MLB o Two planes of the flash array are logically divided into 8 banks, each consists of 2 MLBs o Each MLB contains a. 256 blocks of flash memory b. 1 LUT block c. 255 data blocks
  • 23. Dept. of ECE 23 MAHA : Malleable Hardware Accelerator 29/07/2015 Figure 4: Modified flash memory array for on-demand reconfigurable computing. The memory blocks are augmented with local control and compute logic to act as a hardware reconfigurable unit
  • 24. Dept. of ECE 24 MAHA : Malleable Hardware Accelerator 29/07/2015 2) Routing logic modifications o Each block communicates with the page register over a shared bus o To minimize the inter MLB PI overhead, a set of hierarchical buses with a at each level to select the source of incoming data o 4 levels – banks, sub banks, subarrays Figure 5 : Hierarchical interconnect architecture to connect a group of MLB’s
  • 25. Dept. of ECE 25 MAHA : Malleable Hardware Accelerator 29/07/2015 SOFTWARE ARCHITECTURE • Figure shows application mapping for the proposed acceleration platform. • Mapper (application mapping tool ) was developed in C • Key features of software flow are 1) Description of input application using an ISA  Define an instruction set for the proposed MAHA framework that common control as well as data flow operations  Operation types that are supported by software architecture : o bitswC o bits o mult o shift and rotate o sel o complex o load & store
  • 26. Dept. of ECE 26 MAHA : Malleable Hardware Accelerator 29/07/2015 Figure 6 : Application mapping flow for proposed MAHA framework
  • 27. Dept. of ECE 27 MAHA : Malleable Hardware Accelerator 29/07/2015 2) Application Mapping to a mixed-granular time-multiplexed computing fabric  The mapping process includes 2 key contributions 1) Decomposition of fine & coarse grained operations o During decomposition of load/store operation, memory is allocated in 1 or more MLBs depending on the address size used for load/store & no. of data blocks present inside each MLB 2) Fusing multiple LUT as well as custom datapath operations o 3 fusion routines 1) Fusion of random LUT based operations 2) Fusion of bit-sliceable operations 3) Fusion of custom-datapath operations o In all these, decomposed CDFG is first partitioned into 1 or more vertices
  • 28. Dept. of ECE 28 MAHA : Malleable Hardware Accelerator 29/07/2015 3) Placement & routing for hierarchical interconnect model :  Software tool places the MLBs in hierarchical fashion such that no. of inputs & outputs crossing each module is minimized  In bi-partitioning approach, MLBs are first allocated to the first level modules, then distributed among second-level modules  This continues until each MLB has been mapped to the lowermost memory module  Routing of signals in the CDFG is performed in the following order 1) Routing of signals which cross each level of the memory hierarchy 2) Routing of primary outputs from each MLB for all levels of the cyclic schedule 3) Routing of primary inputs to each MLB for all levels of the cyclic schedule
  • 29. Dept. of ECE 29 MAHA : Malleable Hardware Accelerator 29/07/2015 4) Functional validation of the proposed framework :  Bit file generation routine accepts the placed & routed netlist & the control or select bits for the following 1) Configuration for programmable switches 2) Schedule table entries which control the sequence of operations inside each MLB 3) LUT entries to be loaded into the function table  Bit file generated by the tool can be directly loaded into the function table
  • 30. Dept. of ECE 30 MAHA : Malleable Hardware Accelerator 29/07/2015 RESULTS A. Design space exploration for MAHA B. Energy , Performance, and Overhead estimation  Estimate design overhead for entire MLB as well as for inter-MLB PI  Map the benchmark applications to the MAHA framework  Calculate the area overhead, performance, and energy requirements for each configurations & select best configuration  Cycle time of 20ns for MAHA operation – bitline precharge time (12ns)+ intra-MLB delay(3ns)+inter-MLB signal propagation time(5ns)  Area of single block of flash array-5*F2 * (Npages)*(pagesize) Since LUT block is separate from data blocks, area overhead is different
  • 31. Dept. of ECE 31 MAHA : Malleable Hardware Accelerator 29/07/2015  The parameters noted are : C. Selection of optimal MAHA configuration o Area overhead o Latency o Number of MLBs required to map application o Total energy dissipation in the MLBs o Area & energy for inter-MLB PI o Size of reconfiguration data o Final configuration Figure 7: (a) Relative contribution of different components to total area of modified flash(b) Relative contribution of memory & logic components
  • 32. Dept. of ECE 32 MAHA : Malleable Hardware Accelerator 29/07/2015 D. Energy & performance for mapped applications  Mapping results for a single CDFG instantiation for each of the selected benchmarks mapped to final MAHA hardware configuration  For MAHA, average PI energy is less compared with the average MLB logic energy
  • 33. Dept. of ECE 33 MAHA : Malleable Hardware Accelerator 29/07/2015 E. Comparison with a conventional GPP 1) Reduction in On-chip & off-chip communication 2) Improvement in execution latency 3) Improvement in energy 4) Improvement in EDP
  • 34. Dept. of ECE 34 MAHA : Malleable Hardware Accelerator 29/07/2015 F. Comparison with FPGA & GPU G. Hardware emulation based validation  On an average MAHA improves the energy requirement by 74% & 84% over FPGA & GPU frameworks  MAHA eliminates the high energy overhead for transferring data from off- chip memory to FPGA or GPU  We developed an FPGA –based emulation framework, which validates 1) Functionality & synchronization of multiple MLBs for several application kernels 2) Interfacing the MAHA framework with the host processor  Emulation framework consists of 2 FPGA boards, one DE0, running a host CPU, & a DE4, consisting of 3 main components
  • 35. Dept. of ECE 35 MAHA : Malleable Hardware Accelerator 29/07/2015 o MAHA framework o Flash controller o on board flash memory  The last 2 boards communicate over 3-wire SPI in simple master/slave configuration  The slave queries the flash for all available kernels, & upon finding a match, begins a transfer of the configuration bits & data for processing to MAHA framework .  If no match is found, the slave immediately responds with an error code  Otherwise slave will only interrupt the host CPU
  • 36. Dept. of ECE 36 MAHA : Malleable Hardware Accelerator 29/07/2015 Figure 8 : (a) Overview for off-chip acceleration with MAHA framework (b)System architecture for FPGA- based hardware emulation framework (c) Improvement in latency & energy with MAHA –based off-chip acceleration
  • 37. DISCUSSION  Before mapping a kernel to an-in memory accelerator, key applications & system primitives can be used to determine whether it will benefit from in- memory acceleration. These are listed below : 1) g—fraction of total instructions with memory reference (loads and stores); 2) f —fraction of total instructions transferred to an compute engine; 3) c—fraction of instructions translated from the host’s ISA to the ISA for the off-chip compute framework 4) o—fraction of original instructions, which result in an output. A fraction f × c × o thus produces outputs, which need to be transferred to the host processor; 5) eoffchip—average energy per instruction in the off-chip compute engine; 6) etxfer—energy expended in the transfer of an output from the off-chip framework to the host processor; 7) toffchip—ratio of cycle time of the off-chip compute framework to that of the host processor; 8) n—fraction of speedup due to parallelism in the framework 9) ttxfer—time taken in terms of processor clock cycles to transfer an output from the off-chip compute framework to the host processor. Dept. of ECE 37 MAHA : Malleable Hardware Accelerator 29/07/2015
  • 38.  Tsys = Toffchip + Tproc + Ttxfer  Esys = Eoffchip + Eproc + Etxfer Figure 9 : Energy & performance for a hybrid system with a host processor & off-chip memory based hardware accelerator Dept. of ECE 38 MAHA : Malleable Hardware Accelerator 29/07/2015
  • 39. Dept. of ECE 39 MAHA : Malleable Hardware Accelerator 29/07/2015 CONCLUSION • MAHA , a hardware acceleration framework • Greatly improve energy efficiency for data-intensive applications by transferring computing kernal to last level of memory • Design considerations to modify an SLC NAND flash memory for on-chip reconfigurable computing are presented • Improvement in energy efficiency • Better efficiency compared to FPGA & GPU • Future research efforts can be directed for optimizing the MLB architecture, interconnect topology & mapper software
  • 40. Dept. of ECE 40 MAHA : Malleable Hardware Accelerator 29/07/2015 REFERENCES  MAHA: An Energy-Efficient Malleable Hardware Accelerator for Data- Intensive Applications Somnath Paul, Member, IEEE, Aswin Krishna, Student Member, IEEE, Wenchao Qian, Student Member, IEEE, Robert Karam, Student Member, IEEE, and Swarup Bhunia, Senior Member, IEEE  V. Govindaraju, C.-H. Ko, and K. Sankaralingam, “Dynamically specialized datapaths for energy efficient computing,” in Proc. IEEE 17th Int. Symp. High Perform. Comput. Archit. (HPCA), Feb. 2011, pp. 503–514 and more....
  • 41. Dept. of ECE 41 MAHA : Malleable Hardware Accelerator 29/07/2015 THANK YOU
  • 42. QUERIES ????..... Dept. of ECE 42 MAHA : Malleable Hardware Accelerator 29/07/2015