Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© IMEC 2015 F.CATTHOOR
FAST AND ENERGY-EFFICIENT ENVM
BASED MEMORY ORGANISATION AT L3-
L1 LAYERS FOR IOT
COMPUTING/ROUTING...
© IMEC 2015 F.CATTHOOR
Secure, trustworthy computing and communication
embedded in every-thing and every-body.
A pervasive...
© IMEC 2015 F.CATTHOOR
SYSTEM CLASSES FOR IOT ENVIRONMENT
Cloud/fog
(Stationary)
Nomadic Sensor network
AmbientBody
Home
G...
© IMEC 2015 F.CATTHOOR
CURRENT PLATFORM ARCHITECTURES:
ENERGY-FLEXIBILITY CONFLICT
Courtesy: Engel Roza (Philips)
Goal=pro...
© IMEC 2015 F.CATTHOOR
FOCUS HERE ON IOT GATEWAY-MICROSERVER
Growing part of applications and market
© IMEC 2015 F.CATTHOOR
RELATED (INDUSTRIAL) R&D ON MICROSERVER
MEMORY HIERARCHY
Some research published at Google is avail...
© IMEC 2015 F.CATTHOOR
RELATED (INDUSTRIAL) R&D ON MICROSERVER
MEMORY HIERARCHY
Google: Proc. ISCA, 2014
Towards Energy Pr...
© IMEC 2015 F.CATTHOOR
RELATED (INDUSTRIAL) R&D ON MICROSERVER
MEMORY HIERARCHY
Google: ISCA 2011 The Impact of Memory Sub...
© IMEC 2015 F.CATTHOOR
RELATED (INDUSTRIAL) R&D ON MICROSERVER
MEMORY HIERARCHY
Facebook: Characterizing Load Imbalance in...
© IMEC 2015 F.CATTHOOR
RELATED (INDUSTRIAL) R&D ON MICROSERVER
MEMORY HIERARCHY
Facebook: Fastpass: A Centralized "Zero-Qu...
© IMEC 2015 F.CATTHOOR
RELATED (INDUSTRIAL) R&D ON MICROSERVER
MEMORY HIERARCHY
Luiz André Barroso, Jimmy Clidaras, Urs Hö...
© IMEC 2015 F.CATTHOOR
ENERGY BREAKDOWN FOR TYPICAL SERVER
UPROC
Source Barroso e.a.
The Datacenter as a Computer
Google B...
© IMEC 2015 F.CATTHOOR 14
TYPICAL APPLICATION: ENERGY BREAKDOWN
FOR EMBEDDED PLATFORMS
Non-optim.
Non-optim.
(no adv. DTSE...
© IMEC 2015 F.CATTHOOR
DECODER + WORDLINE CONTRIBUTES SIGNIFICANTLY TO
MEMORY DELAY AND ENERGY: WIDE WORD ACCESS
 A typic...
© IMEC 2015 F.CATTHOOR
VWRVWR
VWRLB
PROPOSED DSIP ARCHITECTURE TEMPLATE
EXPLOITING WIDE L1D/L1I MEMORY ACCESS
Complx
FU1
W...
© IMEC 2015 F.CATTHOOR
REGISTER FILE VS VERY WIDE REGISTER
(VWR)
Register File VWRVWRVWRVWRVWR
Nwords
Bit-out
Bit-
out
Wid...
© IMEC 2015 F.CATTHOOR
MOTIVATION FOR VWR VS RF
 Assumptions:
▸ 8 bit Out, 64 bit wide VWR
▸ Same storage for (multiple) ...
© IMEC 2015 F.CATTHOOR
NONVOLATILE OFF-CHIP MEMORY
ROADMAP
2
4
6
8
NAND-Flash (FG)
(Eq.) technology node
F [nm]
Cell size
...
© IMEC 2015 F.CATTHOOR
VARIOUS POSSIBLE CACHE MEMORIES TO REPLACE
IN HIGH SPEED, LOW EDYN EMBEDDED SOC
PLATFORMS
EACH MEMO...
© IMEC 2015 F.CATTHOOR
LB code for DP:
for (i=0;i<n;i++)
for (j=0;j<Ntaps%WS;j++)
for (k=0;k<WS;k++)
C[i] += A[j+k]*B[i+j*...
© IMEC 2015 F.CATTHOOR
INST MEMORY HIERARCHY: EXECUTING
LOOPS IN PARALLEL
 State-of-the-art loop controllers
are centrali...
© IMEC 2015 F.CATTHOOR
MODIFIED I-CACHE ORGANIZATION
PROCESSOR
NVM IL1
L2 Cache
EMSHR
SELECTOR
• Novel I-cache configurati...
© IMEC 2015 F.CATTHOOR
PERFORMANCE RESULTS
84
86
88
90
92
94
96
98
100
102
104
Threshold:
12
Threshold:
8
Threshold:
4
Per...
© IMEC 2015 F.CATTHOOR
ENERGY IMPROVEMENTS
50
55
60
65
70
75
80
85
Threshold:
12
Threshold:
8
Threshold:
4
RelativeEnergy(...
© IMEC 2015 F.CATTHOOR
VARIOUS POSSIBLE CACHE MEMORIES TO REPLACE
IN HIGH SPEED, LOW EDYN EMBEDDED SOC
PLATFORMS
EACH MEMO...
© IMEC 2015 F.CATTHOOR
PerformancePenalty(%)
33
© IMEC 2015 F.CATTHOOR
PerformancePenalty(%)
© IMEC 2015 F.CATTHOOR
ARCH AND ACCESS PATTERN OPTIMIZATIONS
Modified VWR (2 KBit).
▸2 VWR’s : ping-pong of data and helps...
© IMEC 2015 F.CATTHOOR
PerformancePenalty(%)
© IMEC 2015 F.CATTHOOR
SDRAM
(L2+ main)
PE1
PE5
PE2
PE6
PE3
PE7
Interconnect
PE4
PE8
Embedded Hardware
MEMORY RESOURCE MAN...
© IMEC 2015 F.CATTHOOR
DATA MANAGEMENT FLOW
Dynamic
Data
Type
Explor.
Physical
Memory
Mngnt.
Virtual
Memory
Segments
Concr...
© IMEC 2015 F.CATTHOOR
WHY IS IT IMPORTANT?
DYNAMIC DATA SET IN AN ATM SWITCH
port 1
ATM cells
x2 5 y2 1
y2 5x4 7
ATM
MUX
...
© IMEC 2015 F.CATTHOOR
DATA MANAGEMENT RESULTS FOR
ATM PROTOCOL MODULE IN
ADAPTATION LAYER
Physical
Memory
Mngnt.
Physical...
© IMEC 2015 F.CATTHOOR
Global data management design flow for
dynamic concurrent tasks
with data-dominated behaviour
Data ...
© IMEC 2015 F.CATTHOOR
CONCLUSIONS
IoT infrastructure platforms currently still have energy bottlenecks.
Disruptive approa...
© IMEC 2015 F.CATTHOOR
Upcoming SlideShare
Loading in …5
×

Fast and energy-efficient eNVM based memory organisation at L3-L1 layers for IoT Computing/Routing platforms: main challenges

563 views

Published on

Conferencia impartida por el Prof. Francky Catthoor el 19 de noviembre de 2015 dentro del ciclo de conferencias de Posgrado

Published in: Education
  • Be the first to comment

Fast and energy-efficient eNVM based memory organisation at L3-L1 layers for IoT Computing/Routing platforms: main challenges

  1. 1. © IMEC 2015 F.CATTHOOR FAST AND ENERGY-EFFICIENT ENVM BASED MEMORY ORGANISATION AT L3- L1 LAYERS FOR IOT COMPUTING/ROUTING PLATFORMS: MAIN CHALLENGESFrancky Catthoor, Nov.2015 With input of esp. Praveen Raghavan, Jan Van Houdt, Stefan Cosemans, Matthias Hartmann and other IMEC colleagues With use of MSc and PhD thesis results in cooperation with mem.organ. teams at IMEC, NTUA and UCMadrid Also based on ULP-DSIP PhD team work
  2. 2. © IMEC 2015 F.CATTHOOR Secure, trustworthy computing and communication embedded in every-thing and every-body. A pervasive, context aware ambient IoT environment, sensitive and responsive to the presence of people
  3. 3. © IMEC 2015 F.CATTHOOR SYSTEM CLASSES FOR IOT ENVIRONMENT Cloud/fog (Stationary) Nomadic Sensor network AmbientBody Home Gateway Car “UPA” TCRF TCRF TCRF UMTS WLAN WPAN WBAN DA/VB PDA DSC DVC MP3 GPS HC…Hear, See, Feel, Show… “More Moore” “More-than-Moore” MIMO Internet IPv6 100Gop/s 10 Gop/s 1Watt 100mW ‘Milliwatt’ (battery) 10 Watt ‘Watt’ (mains) 1Top/s 10Mop/s Gb/s Mb/s kb/s 100mW energy energy energy (ambient)‘Microwatt’ Courtesy: Hugo DeMan ISSCC 2005 Server Office
  4. 4. © IMEC 2015 F.CATTHOOR CURRENT PLATFORM ARCHITECTURES: ENERGY-FLEXIBILITY CONFLICT Courtesy: Engel Roza (Philips) Goal=progr DSIP, config CGA as good as ASIC Note: higher than 1000 MOPS/mW reachable in 45 nm node due to smaller subword length than 32 bit and non-standard cell based layout schemes for critical components
  5. 5. © IMEC 2015 F.CATTHOOR FOCUS HERE ON IOT GATEWAY-MICROSERVER Growing part of applications and market
  6. 6. © IMEC 2015 F.CATTHOOR RELATED (INDUSTRIAL) R&D ON MICROSERVER MEMORY HIERARCHY Some research published at Google is available online http://research.google.com/pubs/HardwareandArchitecture.html same for Facebook https://research.facebook.com/publications/systems/ Also many academic papers in this direction but most of them are focusing on performance rather than energy efficiency. And the ones which do focus on energy are too disruptive for industry in the short/mid term because they are changing the entire application software stack.
  7. 7. © IMEC 2015 F.CATTHOOR RELATED (INDUSTRIAL) R&D ON MICROSERVER MEMORY HIERARCHY Google: Proc. ISCA, 2014 Towards Energy Proportionality for Large-Scale Latency- Critical Workloads Abstract: Reducing the energy footprint of warehouse-scale computer (WSC) systems is key to their affordability, yet difficult to achieve in practice. The lack of energy proportionality of typical WSC hardware and the fact that important workloads (such as search) require all servers to remain up regardless of traffic intensity renders existing power management techniques ineffective at reducing WSC energy use. We present PEGASUS, a feedback-based controller that significantly improves the energy proportionality of WSC systems, as demonstrated by a real implementation in a Google search cluster. PEGASUS uses request latency statistics to dynamically adjust server power management limits in a fine-grain manner, running each server just fast enough to meet global service-level latency objectives. In large cluster experiments, PEGASUS reduces power consumption by up to 20%. We also estimate that a distributed version of PEGASUS can nearly double these savings => Interesting but they only appear to gain 20% this way. So some bottlenecks are clearly not addressed yet.
  8. 8. © IMEC 2015 F.CATTHOOR RELATED (INDUSTRIAL) R&D ON MICROSERVER MEMORY HIERARCHY Google: ISCA 2011 The Impact of Memory Subsystem Resource Sharing on Datacenter Applications Abstract: In this paper we study the impact of sharing memory resources on #ve Google datacenter applications: a web search engine, bigtable, content analyzer, image stitching, and protocol bu#er. While prior work has found neither positive nor negative e#ects from cache sharing across the PARSEC benchmark suite, we #nd that across these datacenter applications, there is both a sizable bene#t and a potential degradation from improperly sharing resources. In this paper, we #rst present a study of the importance of thread-tocore mappings for applications in the datacenter as threads can be mapped to share or to not share caches and bus bandwidth. Second, we investigate the impact of co-locating threads from multiple applications with diverse memory behavior and discover that the best mapping for a given application changes depending on its co- runner. Third, we investigate the application characteristics that impact performance in the various thread-to-core mapping scenarios. Finally, we present both a heuristics-based and an adaptive approach to arrive at good thread-to-core decisions in the datacenter. We observe performance swings of up to 25% for web search and 40% for other key applications, simply based on how application threads are mapped to cores. By employing our adaptive thread-to-core mapper, the performance of the datacenter applications presented in this work improved by up to 22% over status quo thread-to-core mapping and performs within 3% of optimal. => Interesting!
  9. 9. © IMEC 2015 F.CATTHOOR RELATED (INDUSTRIAL) R&D ON MICROSERVER MEMORY HIERARCHY Facebook: Characterizing Load Imbalance in Real-World Networked Caches, Qi Huang e.a. HotNets 2014: Thirteenth ACM Workshop on Hot Topics in Networks · October 27, 2014 Abstract: Modern Web services rely extensively upon a tier of in-memory caches to reduce request latencies and alleviate load on backend servers. Within a given cache, items are typically partitioned across cache servers via consistent hashing, with the goal of balancing the number of items maintained by each cache server. Effects of consistent hashing vary by associated hashing function and partitioning ratio. Most real-world workloads are also skewed, with some items significantly more popular than others. Inefficiency in addressing both issues can create an imbalance in cache-server loads. We analyze the degree of observed load imbalance, focusing on read-only traffic against Facebook's graph cache tier in Tao. We investigate the principal causes of load imbalance, including data co-location, non-ideal hashing scenarios, and hot-spot temporal effects. We also employ trace-drive analytics to study the benefits and limitations of current load- balancing methods, suggesting areas for future research. => this analysis looks very interesting
  10. 10. © IMEC 2015 F.CATTHOOR RELATED (INDUSTRIAL) R&D ON MICROSERVER MEMORY HIERARCHY Facebook: Fastpass: A Centralized "Zero-Queue" Datacenter Networ Jonathan Perry e.a. ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM) · August 18, 2014 Abstract: Current datacenter networks inherit the principles that went into the design of the Internet, where packet transmission and path selection decisions are distributed among the endpoints and routers. Instead, we propose that each sender should delegate control-to a centralized arbiter-of when each packet should be transmitted and what path it should follow. => Interesting analysis!
  11. 11. © IMEC 2015 F.CATTHOOR RELATED (INDUSTRIAL) R&D ON MICROSERVER MEMORY HIERARCHY Luiz André Barroso, Jimmy Clidaras, Urs Hölzle Morgan & Claypool Publishers (2013) The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition Abstract: As computation continues to move into the cloud, the computing platform of interest no longer resembles a pizza box or a refrigerator, but a warehouse full of computers. These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must treat the datacenter itself as one massive warehouse- scale computer (WSC). We describe the architecture of WSCs, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base. We hope it will be useful to architects and programmers of today's WSCs, as well as those of future many-core platforms which may one day implement the equivalent of today's WSCs on a single board. Notes for the Second Edition After nearly four years of substantial academic and industrial developments in warehouse-scale computing, we are delighted to present our first major update to this lecture. The increased popularity of public clouds has made WSC software techniques relevant to a larger pool of programmers since our first edition. Therefore, we expanded Chapter 2 to reflect our better understanding of WSC software systems and the toolbox of software techniques for WSC programming. In Chapter 3, we added to our coverage of the evolving landscape of wimpy vs. brawny server trade-offs, and we now present an overview of WSC interconnects and storage systems that was promised but lacking in the original edition. Thanks largely to the help of our new co-author, Google Distinguished Engineer Jimmy Clidaras, the material on facility mechanical and power distribution design has been updated and greatly extended (see Chapters 4 and 5). Chapters 6 and 7 have also been revamped significantly. We hope this revised edition continues to meet the needs of educators and professionals in this area. => Very interesting overview of SotA
  12. 12. © IMEC 2015 F.CATTHOOR ENERGY BREAKDOWN FOR TYPICAL SERVER UPROC Source Barroso e.a. The Datacenter as a Computer Google Books’13
  13. 13. © IMEC 2015 F.CATTHOOR 14 TYPICAL APPLICATION: ENERGY BREAKDOWN FOR EMBEDDED PLATFORMS Non-optim. Non-optim. (no adv. DTSE) Optim. ITSE mapping (no ITSE trafo yet) MPEG2 Decoding on TI C6x VLIW-DSP [Lambrechts-ASAP05] WLAN activity on IMEC BOADRES-coarse grain array Audio proc on ARM A19 [Carroll’10]: ~65%CPU ~25%DM ~10%IM FFT proc on IMEC- BLOX [CSI’13]: ~50% proc ~50%DM
  14. 14. © IMEC 2015 F.CATTHOOR DECODER + WORDLINE CONTRIBUTES SIGNIFICANTLY TO MEMORY DELAY AND ENERGY: WIDE WORD ACCESS  A typical small SRAM (<64kb) sized in conventional way  Breakdown of delay and energy of such SRAM ▸ Decoder + Wordline contributes nearly 60% of SRAM delay ▸ Decoder + Wordline contributes about 40%-50% of SRAM energy Source: ▸ Energy: Evans et.al. J.Solid-State circ. 1995 ▸ Delay: Horowitz et al. Trans. Solid-State circ. 2002 SRAM delay breakdown delay of decoder+WL delay of the rest SRAM energy breakdown energy of decoder+WL energy of the rest part
  15. 15. © IMEC 2015 F.CATTHOOR VWRVWR VWRLB PROPOSED DSIP ARCHITECTURE TEMPLATE EXPLOITING WIDE L1D/L1I MEMORY ACCESS Complx FU1 Wide Scratch Pad (Level-1 DM) External memory (SDRAM) AGU LD/ST Prog. DMA VWRLBDPLBMMULBDMALB Level-1 I-Cache VWRLB VWR SWP Shifter Complx FU2 1 Tile/node in a Platform Level-0 inst mem (Level-0 DM) Width-1 Width-2: very wide word Width-3: data- path word F.Catthoor e.a. ULD DSIPs Springer book 2010
  16. 16. © IMEC 2015 F.CATTHOOR REGISTER FILE VS VERY WIDE REGISTER (VWR) Register File VWRVWRVWRVWRVWR Nwords Bit-out Bit- out Width = Nwords* Bit- out / Nports Number of bits stored in Register File and all the VWRs combined, are equal Nports Nports VWR Width = Nwords* Bit-out / Nports = Very Wide Word = 960 bits Bit-out = Data path Word = 96 bits
  17. 17. © IMEC 2015 F.CATTHOOR MOTIVATION FOR VWR VS RF  Assumptions: ▸ 8 bit Out, 64 bit wide VWR ▸ Same storage for (multiple) VWRs and (single) RF ▸ Same Total Number of ports  Conclusions ▸VWR reduces complexity of decoder and net capacitive load at the RF drivers and hence always better than RF ▸But additional complexity in compiler Figure 1: Same Memory Footprint Figure 2: 8 bit Activity
  18. 18. © IMEC 2015 F.CATTHOOR NONVOLATILE OFF-CHIP MEMORY ROADMAP 2 4 6 8 NAND-Flash (FG) (Eq.) technology node F [nm] Cell size [F2] 180 130 90 65 5x 3x NOR-Flash (FG/NROM) PCM RRAM 2x TANOS 4x 1x code data evolutionary disruptive production 3D-NAND Flash 19 Copyright: Jan Van Houdt, IMEC, 2010 Conclusion: main focus now on stand-alone applications
  19. 19. © IMEC 2015 F.CATTHOOR VARIOUS POSSIBLE CACHE MEMORIES TO REPLACE IN HIGH SPEED, LOW EDYN EMBEDDED SOC PLATFORMS EACH MEMORY NEEDS UNIQUE POLICY FOR SRAM REPLACEMENT ARM A-15 ARM A-15 L1 D L1 D L1 D L1I MPEG4 accelerator Mem Mem L2 Memory LTE receiver L1 D L1I Turbo decoder HARQ Mem L1 D L1I Instruction memory can exploit non- volatility High speed data needs efficient latency masking Slower memories need just enough masking Slower memories could handle replacement with some mitigation
  20. 20. © IMEC 2015 F.CATTHOOR LB code for DP: for (i=0;i<n;i++) for (j=0;j<Ntaps%WS;j++) for (k=0;k<WS;k++) C[i] += A[j+k]*B[i+j*WS+k] *C[i] += A[p]*B[p] for p = 1 to Rem DISTRIBUTED INSTRUCTION MEMORY HIERARCHY LB code for MMU: for (i=0;i<n;i++) for (j=0;j<Ntaps%WS;j++) for (k=0;k<WS;k++) Load A[j to j+WS] Load B[i+j to i+j+WS] Store C[i] LB code for VWR MUX: for (i=0;i<n;i++) for (j=0;j<Ntaps%WS;j++) for (k=0;k<WS;k++) Select Mux of A, B = k *Select Mux of A, B = 1 to Rem Select Mux for C = i Original Code: for (i=0;i<n;i++) for (j=0;j<Ntaps;j++) C[i] += A[j]*B[i+j] VWRVWR VWRLB Complx FU Wide Scratch Pad AGU LD/ST VWRLBDPLBVWRLB VWR SWP Shifter AGULB Width-2: very wide word Width-3: data- path word
  21. 21. © IMEC 2015 F.CATTHOOR INST MEMORY HIERARCHY: EXECUTING LOOPS IN PARALLEL  State-of-the-art loop controllers are centralized, we have a distributed approach ▸ Loop cache, loop counters..etc [Bajwa et al] ▸ Some commercial (embedded) processors  Simultaneous Multi-Threaded Architecture and Multi-processor approaches have high hardware overhead.  Software control and control flow extraction in the compiler are not handled by any known state-of-the- art 0.00E+00 5.00E-09 1.00E-08 1.50E-08 2.00E-08 2.50E-08 Register File Instruction Memory Data Memory Data Path VLIW FEENECS 23x Gain = Much lower instructions + Distributed LB for each unit F.Catthoor e.a. ULD DSIPs Springer book 2010
  22. 22. © IMEC 2015 F.CATTHOOR MODIFIED I-CACHE ORGANIZATION PROCESSOR NVM IL1 L2 Cache EMSHR SELECTOR • Novel I-cache configuration • address write latency & write energy issues of eNVM options like STT-MRAM. • EMSHR as a fully-associative buffer • few entries • Block promotion to IL1 based on threshold M.Komalan e.a., DATE’2014
  23. 23. © IMEC 2015 F.CATTHOOR PERFORMANCE RESULTS 84 86 88 90 92 94 96 98 100 102 104 Threshold: 12 Threshold: 8 Threshold: 4 PerformancePenalty(%) Modified NVM I- cache with 64KB capacity Performance Penalty eliminated Better Performance than SRAM-based Instr. Memory
  24. 24. © IMEC 2015 F.CATTHOOR ENERGY IMPROVEMENTS 50 55 60 65 70 75 80 85 Threshold: 12 Threshold: 8 Threshold: 4 RelativeEnergy(%) Modified NVM I- cache with 64KB capacity Factor 1.5 in Energy for STT-MRAM based Instruction Memories NVM model still under investigation! Reloading and Scenarios not taken into account (currently only static mode) M.Komalan e.a., DATE’2014
  25. 25. © IMEC 2015 F.CATTHOOR VARIOUS POSSIBLE CACHE MEMORIES TO REPLACE IN HIGH SPEED, LOW EDYN EMBEDDED SOC PLATFORMS EACH MEMORY NEEDS UNIQUE POLICY FOR SRAM REPLACEMENT ARM A-15 ARM A-15 L1 D L1 D L1 D L1I MPEG4 accelerator Mem Mem L2 Memory LTE receiver L1 D L1I Turbo decoder HARQ Mem L1 D L1I Instruction memory can exploit non- volatility High speed data needs efficient latency masking Slower memories need just enough masking Slower memories could handle replacement with some mitigation
  26. 26. © IMEC 2015 F.CATTHOOR PerformancePenalty(%) 33
  27. 27. © IMEC 2015 F.CATTHOOR PerformancePenalty(%)
  28. 28. © IMEC 2015 F.CATTHOOR ARCH AND ACCESS PATTERN OPTIMIZATIONS Modified VWR (2 KBit). ▸2 VWR’s : ping-pong of data and helps mitigate performance issues due to read delay of STT. Code transformations (Vectorization, prefetching, instr.rescheduling) and some in-built compiler optimizations. M.Komalan e.a., DATE’2015
  29. 29. © IMEC 2015 F.CATTHOOR PerformancePenalty(%)
  30. 30. © IMEC 2015 F.CATTHOOR SDRAM (L2+ main) PE1 PE5 PE2 PE6 PE3 PE7 Interconnect PE4 PE8 Embedded Hardware MEMORY RESOURCE MANAGEMENT: OVERALL VIEW Middleware (embedded in memory IP module) Mem Resource Manager/TDTSE Application 1 Scalable 3D Graphics Application 2 Wireless Network Application 2 Video Codec Embedded Software Applications Given ▸ Set of active tasks and their data req ▸ Task’s metadata ▸ Constraints like deadline & throughput ▸ Objective (e.g. minimize energy consumption) Decides ▸ Where to acces (data to resource assignment) ▸ When to access (scheduling) ▸ How to access (with what mem “configuration”) Also referred as “task-level data access scheduling / mapping” Proposal: use scenario based run- time scheme embedded inside memory organization (no application code or processor RISC VLIW SIMD ASIC All PE have local L1 SRAM mem Run-time Design-time D.Atienza e.a., Springer book’14
  31. 31. © IMEC 2015 F.CATTHOOR DATA MANAGEMENT FLOW Dynamic Data Type Explor. Physical Memory Mngnt. Virtual Memory Segments Concrete Data types Physical Memories DDT Dynamic Data Type Trafo & Refinement Dynamic memory mgmt Refinement Physical memory mgmt Refinement
  32. 32. © IMEC 2015 F.CATTHOOR WHY IS IT IMPORTANT? DYNAMIC DATA SET IN AN ATM SWITCH port 1 ATM cells x2 5 y2 1 y2 5x4 7 ATM MUX ATM MUX 2 = 0000 0010 2 = 0000 0010 5 = 0000 0000 0000 0101 1 = 0000 0000 0000 0001 1 = 0000 0001 1 = 0000 0001 4 = 0000 0100 7 = 0000 0000 0000 0111 2 = 0000 0010 2 = 0000 0010 5 = 0000 0000 0000 0101 2 = 0000 0010 Key1 ( VPI[8] ) Key3 ( port[8] ) { VPI, VCI, port } [32]Key2 ( VCI[16] ) 155 Mb/sec Table of active connections Table size without optimization: 16,284 Mbytes
  33. 33. © IMEC 2015 F.CATTHOOR DATA MANAGEMENT RESULTS FOR ATM PROTOCOL MODULE IN ADAPTATION LAYER Physical Memory Mngnt. Physical Memories Physical memory mgmt Refinement Virtual Memory Segments& Pools Dynamic memory mgmt Refinement Concrete Data types DDT Dynamic Data Type Refinement Factor 5 less accesses (and energy) Factor 3 less energy Factor 2 less memory ports for same cycle budget (throughput) Dynamic Data Type Explor. D.Atienza e.a., Springer book’14
  34. 34. © IMEC 2015 F.CATTHOOR Global data management design flow for dynamic concurrent tasks with data-dominated behaviour Data Type Exploration Task concurrency mgmt Physical memory mgmt Address optimization SW design flow HW design flow Concurrent OO spec Mgmt Unit Memory controller ASU ASU processor memmemmem Memory Allocation Assignment SW/HW co-design Virtual Mgmt Memory Dynamic Data Types key data key data Binary Tree (BT) key data Sub-pool per size Free Blocks D.Atienza e.a., Springer book’1
  35. 35. © IMEC 2015 F.CATTHOOR CONCLUSIONS IoT infrastructure platforms currently still have energy bottlenecks. Disruptive approaches which change the application software stack are very challenging to introduce in industry So instead go for changes inside the memory organization combined with introduction of new NVM technologies, which are not visible to the application code stack Tuning of selective parameters can reduce the performance penalty due to the NVM to extremely tolerable levels ( ≈1%). And significant energy reductions are potentially available (depends on read energy/access though) Pareto-optimum values for the different parameters are application and platform dependent. Has to be combined with matched and optimized dynamic memory management approach in middleware layer (hypervisor) Applying this for IoT infrastructure platforms is promising domain
  36. 36. © IMEC 2015 F.CATTHOOR

×