Mp So C 18 Apr

823 views
691 views

Published on

Fashion, apparel, textile, merchandising, garments

Published in: Business, Lifestyle
1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total views
823
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
28
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide

Mp So C 18 Apr

  1. 1. NoC: MPSoC Communication Fabric Interconnection Networks (ELE 580) Shougata Ghosh 18 th Apr, 2006
  2. 2. Outline <ul><li>MPSoC </li></ul><ul><li>Network-On-Chip </li></ul><ul><li>Cases: </li></ul><ul><ul><li>IBM CoreConnect </li></ul></ul><ul><ul><li>CrossBow IPs </li></ul></ul><ul><ul><li>Sonic Silicon Backplane </li></ul></ul>
  3. 3. What are MPSoCs? <ul><li>MPSoC – Multiprocessor System-On-Chip </li></ul><ul><li>Most SoCs today use multiple processing cores </li></ul><ul><li>MPSoCs are characterised by heterogeneous multiprocessors </li></ul><ul><li>CPUs, IPs (Intellectual Properties), DSP cores, Memory, Communication Handler (USB, UART, etc) </li></ul>
  4. 4. Where are MPSoCs used? <ul><li>Cell phones </li></ul><ul><li>Network Processors </li></ul><ul><li>(Used by Telecomm. and networking to handle high data rates) </li></ul><ul><li>Digital Television and set-top boxes </li></ul><ul><li>High Definition Television </li></ul><ul><li>Video games (PS emotion engine) </li></ul>
  5. 5. Challenges <ul><li>All MPSoC designs have the following requirements: </li></ul><ul><ul><li>Speed </li></ul></ul><ul><ul><li>Power </li></ul></ul><ul><ul><li>Area </li></ul></ul><ul><ul><li>Application Performance </li></ul></ul><ul><ul><li>Time to market </li></ul></ul>
  6. 6. Why Reinvent the wheel? <ul><li>Why not use uniprocessor (3.4 GHz!!)? </li></ul><ul><ul><li>PDAs are usually uniprocessor </li></ul></ul><ul><li>Cannot keep up with real-time processing requirements </li></ul><ul><ul><li>Slow for real-time data </li></ul></ul><ul><li>Real-time processing requires “real” concurrency </li></ul><ul><li>Uniprocessors provide “apparent” concurrency through multitasking (OS) </li></ul><ul><li>Multiprocessors can provide concurrency required to handle real-time events </li></ul>
  7. 7. Need multiple Processors <ul><li>Why not SMPs? </li></ul><ul><ul><li>+SMPs are cheaper (reuse) </li></ul></ul><ul><ul><li>+Easier to program </li></ul></ul><ul><ul><li>-Unpredictable delays (ex: Snoopy cache) </li></ul></ul><ul><ul><li>-Need buffering to handle unpredictability </li></ul></ul>
  8. 8. Area concerns <ul><li>Configured SMPs would have unused resources </li></ul><ul><li>Special purpose PEs: </li></ul><ul><ul><li>Don’t need to support unwanted processes </li></ul></ul><ul><ul><ul><li>Faster </li></ul></ul></ul><ul><ul><ul><li>Area efficient </li></ul></ul></ul><ul><ul><ul><li>Power efficient </li></ul></ul></ul><ul><ul><li>Can exploit known memory access patterns </li></ul></ul><ul><ul><ul><li>Smaller Caches (Area savings) </li></ul></ul></ul>
  9. 9. MPSoC Architecture
  10. 10. Components <ul><li>Hardware </li></ul><ul><ul><li>Multiple processors </li></ul></ul><ul><ul><li>Non-programmable IPs </li></ul></ul><ul><ul><li>Memory </li></ul></ul><ul><ul><li>Communication Interface </li></ul></ul><ul><ul><ul><li>Interface heterogeneous components to Comm. Network </li></ul></ul></ul><ul><ul><li>Communication Network </li></ul></ul><ul><ul><ul><li>Hierarchical (Busses) </li></ul></ul></ul><ul><ul><ul><li>NoC </li></ul></ul></ul>
  11. 11. Design Flow <ul><li>System-level-synthesis </li></ul><ul><ul><li>Top-down approach </li></ul></ul><ul><ul><li>Synthesis algo. ->SoC Arch. + SW Model from system-level specs. </li></ul></ul><ul><li>Platform-based Design </li></ul><ul><ul><li>Starts with Functional System Spec. + Predesigned Platform </li></ul></ul><ul><ul><li>Mapping & Scheduling of functions to HW/SW </li></ul></ul><ul><li>Component-based Design </li></ul><ul><ul><li>Bottom-up approach </li></ul></ul>
  12. 12. Platform Based Design <ul><li>Start with functional Spec : Task Graphs </li></ul><ul><li>Task graph </li></ul><ul><ul><li>Nodes: Tasks to complete </li></ul></ul><ul><ul><li>Edges: Communication and Dependence between tasks </li></ul></ul><ul><li>Execution time on the nodes </li></ul><ul><li>Data communicated on the edges </li></ul>
  13. 13. <ul><li>Map tasks on pre designed HW </li></ul><ul><li>Use Extended Task Graph for SW and Communication </li></ul>
  14. 14. <ul><li>Mapping on to HW </li></ul><ul><li>Gantt chart: Scheduling task execution & Timing analysis </li></ul><ul><li>Extended Task Graph </li></ul><ul><ul><li>Comm. Nodes </li></ul></ul><ul><ul><ul><li>(Reads and Writes) </li></ul></ul></ul><ul><li>ILP and Heuristic Algo. to schedule Task and Comm. to HW and SW </li></ul>
  15. 15. Component Based Design <ul><li>Conceptual MPSoC Platform </li></ul><ul><li>SW, Processor, IP, Comm. Fabric </li></ul><ul><li>Parallel Development </li></ul><ul><ul><li>Use APIs </li></ul></ul><ul><li>Quicker time to market </li></ul>
  16. 16. Design Flow Schematic
  17. 17. Communication Fabric <ul><li>Has been mostly Bus based </li></ul><ul><ul><li>IBM CoreConnect, Sonic Silicon Backplane, etc. </li></ul></ul><ul><li>Busses not scalable!! </li></ul><ul><ul><li>Usually 5 Processors – rarely more than 10! </li></ul></ul><ul><li>Number of cores has been increasing </li></ul><ul><ul><li>Push towards NoC </li></ul></ul>
  18. 18. NoC NoC NoC-ing on Heaven’s Door!! <ul><li>Typical Network-On-Chip (Regular) </li></ul>
  19. 19. Regular NoC <ul><li>Bunch of tiles </li></ul><ul><li>Each tile has input (inject into network) and output (recv. From network) ports </li></ul><ul><li>Input port => 256-bit Data 38-bit Control </li></ul><ul><li>Network handles both static and dynamic traffic </li></ul><ul><ul><li>Static: Flow of data from camera to MPEG encoder </li></ul></ul><ul><ul><li>Dynamic: Memory request from PE (or CPU) </li></ul></ul><ul><li>Uses dedicated VC for static traffic </li></ul><ul><li>Dynamic traffic goes through arbitration </li></ul>
  20. 20. Control Bits <ul><li>Control bit fields </li></ul><ul><ul><li>Type (2 bits): Head, Body, Tail, Idle </li></ul></ul><ul><ul><li>Size (4 bits): Data size 0 (1-bit) to 8 (256-bit) </li></ul></ul><ul><ul><li>VC Mask (8 bits): Mask to determine VC (out of 8) </li></ul></ul><ul><ul><ul><li> Can be used to prioritise </li></ul></ul></ul><ul><ul><li>Route (16 bits): Source routing </li></ul></ul><ul><ul><li>Ready (8 bits): Signal from network indicating it’s ready to accept the next flit (??why 8?) </li></ul></ul>
  21. 21. Flow Control <ul><li>Virtual Channel flow control </li></ul><ul><li>Router with input and output controller </li></ul><ul><li>Input controller has buffer and state for each VC </li></ul><ul><li>Inp. controller strips routing info from head flit </li></ul><ul><li>Flit arbitrates for output VC </li></ul><ul><li>Output VC has buffer for single flit </li></ul><ul><ul><li>Used to store flit trying to get inp. buffer in next hop </li></ul></ul>
  22. 22. Input and Output Controllers
  23. 23. NoC Issues <ul><li>Basic difference between NoC and Inter-chip or Inter-board networks: </li></ul><ul><ul><li>Wires and pins are ABUNDANT in NoC </li></ul></ul><ul><ul><li>Buffer space is limited in NoC </li></ul></ul><ul><li>On-Chip pins for each tile could be 24,000 compared to 1000 for inter-chip designs </li></ul><ul><li>Designers can trade wiring resources for network performance! </li></ul><ul><li>Channels: </li></ul><ul><ul><li>On-Chip => 300 bits </li></ul></ul><ul><ul><li>Inter-Chip => 8-16 bits </li></ul></ul>
  24. 24. Topology <ul><li>The previous design used folded torus </li></ul><ul><li>Folded torus has twice the wire demand and twice the bisection BW compared to mesh </li></ul><ul><li>Converts plentiful wires to bandwidth (performance) </li></ul><ul><li>Not hard to implement On-Chip </li></ul><ul><li>However, could be more power hungry </li></ul>
  25. 25. Flow Control Decision <ul><li>Area scarce in On-Chip designs </li></ul><ul><li>Buffers use up a LOT of area </li></ul><ul><li>Flow control with less buffers are favourable </li></ul><ul><li>However, need to balance with performance </li></ul><ul><ul><li>Dropping pkt. FC requires least buffer but at the expense of performance </li></ul></ul><ul><ul><li>Misrouting when enough path diversity </li></ul></ul>
  26. 26. High Performance Circuits <ul><li>Wiring regular and known at design time </li></ul><ul><li>Can be accurately modeled (R, L, C) </li></ul><ul><li>This enables: </li></ul><ul><ul><li>Low swing circuit – 100mV compared to 1V </li></ul></ul><ul><ul><ul><li>HUGE power saving </li></ul></ul></ul><ul><ul><li>Overdrive produces 3 times signal velocity compared to full-swing drivers </li></ul></ul><ul><ul><li>Overdrive increases repeater spacing </li></ul></ul><ul><ul><ul><li>Again significant power savings </li></ul></ul></ul>
  27. 27. Heterogeneous NoC <ul><li>Regular topologies facilitate modular design and easily scaled up by replication </li></ul><ul><li>However, for heterogeneous systems, regular topologies lead to overdesigns!! </li></ul><ul><li>Heterogeneous NoCs can optimise local bottlenecks </li></ul><ul><li>Solution? </li></ul><ul><ul><li>Complete Application Specific NoC synthesis flow </li></ul></ul><ul><ul><li>Customised topology and NoC building blocks </li></ul></ul>
  28. 28. xPipe Lite <ul><li>Application Specific NoC library </li></ul><ul><li>Creates application specific NoC </li></ul><ul><ul><li>Uses library of NI, switch and link </li></ul></ul><ul><ul><li>Parameterised library modules optimised for frequency and low latency </li></ul></ul><ul><li>Packet switched communication </li></ul><ul><li>Source routing </li></ul><ul><li>Wormhole flow control </li></ul><ul><li>Topology: Torus, Mesh, B-Tree, Butterfly </li></ul>
  29. 29. NoC Architecture Block Diagram
  30. 30. xPipes Lite <ul><li>Uses OCP to communicate with cores </li></ul><ul><li>OCP advantages: </li></ul><ul><ul><li>Industry wide standard for comm. protocol between cores and NoC </li></ul></ul><ul><ul><li>Allows parallel development of cores and NoC </li></ul></ul><ul><ul><li>Smoother development of modules </li></ul></ul><ul><ul><li>Faster time to market </li></ul></ul>
  31. 31. xPipes Lite – Network Interface <ul><li>Bridges OCP interface and NoC switching fabric </li></ul><ul><li>Functions: </li></ul><ul><ul><li>Synch. Between OCP and xPipes timing </li></ul></ul><ul><ul><li>Packeting OCP transaction to flits </li></ul></ul><ul><ul><li>Route calculation </li></ul></ul><ul><ul><li>Flit buffering to improve performance </li></ul></ul>
  32. 32. NI <ul><li>Uses 2 registers to interface with OCP </li></ul><ul><ul><li>Header reg. to store address (sent once) </li></ul></ul><ul><ul><li>Payload reg. to store data (sent multiple times for burst transfers) </li></ul></ul><ul><li>Flits generated from the registers </li></ul><ul><ul><li>Header flit from Header reg. </li></ul></ul><ul><ul><li>Body/payload flits from Payload reg. </li></ul></ul><ul><li>Routing info. in header flit </li></ul><ul><ul><li>Route determined from LUT using the dest. address </li></ul></ul>
  33. 33. Network Interface <ul><li>Bidirectional NI </li></ul><ul><li>Output stage identical to xPipes switches </li></ul><ul><li>Input stage uses dual-flit buffers </li></ul><ul><li>Uses the same flow control as the switches </li></ul>
  34. 34. Switch Architecture <ul><li>xPipes switch is the basic building block of the switching fabric </li></ul><ul><li>2-cycle latency </li></ul><ul><li>Output queued router </li></ul><ul><li>Fixed and round robin priority arbitration on input lines </li></ul><ul><li>Flow control </li></ul><ul><ul><li>ACK/nACK </li></ul></ul><ul><ul><li>Go-Back-N semantics </li></ul></ul><ul><li>CRC </li></ul>
  35. 35. Switch <ul><li>Allocator module does the arbitration for head flit </li></ul><ul><li>Holds path until tail flit </li></ul><ul><li>Routing info requests the output port </li></ul><ul><li>The switch is parameterisable in: </li></ul><ul><ul><li>Number of input/output, arbitration policy, output buffer sizes </li></ul></ul>
  36. 36. Switch flow control <ul><li>Input flit dropped if: </li></ul><ul><ul><li>Requested output port held by previous packet </li></ul></ul><ul><ul><li>Output buffer full </li></ul></ul><ul><ul><li>Lost the arbitration </li></ul></ul><ul><li>NACK sent back </li></ul><ul><li>All subsequent flits of that packet dropped until header flit reappears </li></ul><ul><ul><li>(Go-Back-N flow control) </li></ul></ul><ul><li>Updates routing info for next switch </li></ul>
  37. 37. xPipes Lite - Links <ul><li>The links are pipelined to overcome interconnect delay problem </li></ul><ul><li>xPipes Lite uses shallow pipelines for all modules (NI, Switch) </li></ul><ul><ul><li>Low latency </li></ul></ul><ul><ul><li>Less buffer requirement </li></ul></ul><ul><ul><li>Area savings </li></ul></ul><ul><ul><li>Higher frequency </li></ul></ul>
  38. 38. xPipes Lite Design Flow
  39. 39. IBM CoreConnect
  40. 40. CoreConnect Bus Architecture <ul><li>An open 32-, 64-, 128-bit core on-chip bus standard </li></ul><ul><li>Communication fabric for IBM Blue Logic and other non-IBM devices </li></ul><ul><li>Provides high bandwidth with hierarchical bus structure </li></ul><ul><ul><li>Processor Local Bus (PLB) </li></ul></ul><ul><ul><li>On-Chip Peripheral Bus (OPB) </li></ul></ul><ul><ul><li>Device Control Register bus (DCR) </li></ul></ul>
  41. 41. Performance Features
  42. 42. CoreConnect Components <ul><li>PLB </li></ul><ul><li>OPB </li></ul><ul><li>DCR </li></ul><ul><li>PLB Arbiter </li></ul><ul><li>OPB Arbiter </li></ul><ul><li>PLB to OPB Bridge </li></ul><ul><li>OPB to PLB Bridge </li></ul>
  43. 43. PLB
  44. 44. Processor Local Bus <ul><li>Fully synchronous, supports up to 8 masters </li></ul><ul><li>32-, 64-, and 128-bit architecture versions; extendable to 256-bit </li></ul><ul><li>Separate read/write data buses, enables overlapped transfers and higher data rates </li></ul><ul><li>High Bandwidth Capabilities </li></ul><ul><ul><li>Burst transfers, variable and fixed-length supported </li></ul></ul><ul><ul><li>Pipelining </li></ul></ul><ul><ul><li>Split transactions </li></ul></ul><ul><ul><li>DMA transfers </li></ul></ul><ul><ul><li>No on-chip tri-states required </li></ul></ul><ul><ul><li>Cache Line transfers </li></ul></ul><ul><ul><li>Overlapped arbitration, programmable priority fairness </li></ul></ul>
  45. 45. Processor Local Bus (cont’d.)

×