• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Mp So C 18 Apr

Mp So C 18 Apr



Fashion, apparel, textile, merchandising, garments

Fashion, apparel, textile, merchandising, garments



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.


11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Mp So C 18 Apr Mp So C 18 Apr Presentation Transcript

    • NoC: MPSoC Communication Fabric Interconnection Networks (ELE 580) Shougata Ghosh 18 th Apr, 2006
    • Outline
      • MPSoC
      • Network-On-Chip
      • Cases:
        • IBM CoreConnect
        • CrossBow IPs
        • Sonic Silicon Backplane
    • What are MPSoCs?
      • MPSoC – Multiprocessor System-On-Chip
      • Most SoCs today use multiple processing cores
      • MPSoCs are characterised by heterogeneous multiprocessors
      • CPUs, IPs (Intellectual Properties), DSP cores, Memory, Communication Handler (USB, UART, etc)
    • Where are MPSoCs used?
      • Cell phones
      • Network Processors
      • (Used by Telecomm. and networking to handle high data rates)
      • Digital Television and set-top boxes
      • High Definition Television
      • Video games (PS emotion engine)
    • Challenges
      • All MPSoC designs have the following requirements:
        • Speed
        • Power
        • Area
        • Application Performance
        • Time to market
    • Why Reinvent the wheel?
      • Why not use uniprocessor (3.4 GHz!!)?
        • PDAs are usually uniprocessor
      • Cannot keep up with real-time processing requirements
        • Slow for real-time data
      • Real-time processing requires “real” concurrency
      • Uniprocessors provide “apparent” concurrency through multitasking (OS)
      • Multiprocessors can provide concurrency required to handle real-time events
    • Need multiple Processors
      • Why not SMPs?
        • +SMPs are cheaper (reuse)
        • +Easier to program
        • -Unpredictable delays (ex: Snoopy cache)
        • -Need buffering to handle unpredictability
    • Area concerns
      • Configured SMPs would have unused resources
      • Special purpose PEs:
        • Don’t need to support unwanted processes
          • Faster
          • Area efficient
          • Power efficient
        • Can exploit known memory access patterns
          • Smaller Caches (Area savings)
    • MPSoC Architecture
    • Components
      • Hardware
        • Multiple processors
        • Non-programmable IPs
        • Memory
        • Communication Interface
          • Interface heterogeneous components to Comm. Network
        • Communication Network
          • Hierarchical (Busses)
          • NoC
    • Design Flow
      • System-level-synthesis
        • Top-down approach
        • Synthesis algo. ->SoC Arch. + SW Model from system-level specs.
      • Platform-based Design
        • Starts with Functional System Spec. + Predesigned Platform
        • Mapping & Scheduling of functions to HW/SW
      • Component-based Design
        • Bottom-up approach
    • Platform Based Design
      • Start with functional Spec : Task Graphs
      • Task graph
        • Nodes: Tasks to complete
        • Edges: Communication and Dependence between tasks
      • Execution time on the nodes
      • Data communicated on the edges
      • Map tasks on pre designed HW
      • Use Extended Task Graph for SW and Communication
      • Mapping on to HW
      • Gantt chart: Scheduling task execution & Timing analysis
      • Extended Task Graph
        • Comm. Nodes
          • (Reads and Writes)
      • ILP and Heuristic Algo. to schedule Task and Comm. to HW and SW
    • Component Based Design
      • Conceptual MPSoC Platform
      • SW, Processor, IP, Comm. Fabric
      • Parallel Development
        • Use APIs
      • Quicker time to market
    • Design Flow Schematic
    • Communication Fabric
      • Has been mostly Bus based
        • IBM CoreConnect, Sonic Silicon Backplane, etc.
      • Busses not scalable!!
        • Usually 5 Processors – rarely more than 10!
      • Number of cores has been increasing
        • Push towards NoC
    • NoC NoC NoC-ing on Heaven’s Door!!
      • Typical Network-On-Chip (Regular)
    • Regular NoC
      • Bunch of tiles
      • Each tile has input (inject into network) and output (recv. From network) ports
      • Input port => 256-bit Data 38-bit Control
      • Network handles both static and dynamic traffic
        • Static: Flow of data from camera to MPEG encoder
        • Dynamic: Memory request from PE (or CPU)
      • Uses dedicated VC for static traffic
      • Dynamic traffic goes through arbitration
    • Control Bits
      • Control bit fields
        • Type (2 bits): Head, Body, Tail, Idle
        • Size (4 bits): Data size 0 (1-bit) to 8 (256-bit)
        • VC Mask (8 bits): Mask to determine VC (out of 8)
          • Can be used to prioritise
        • Route (16 bits): Source routing
        • Ready (8 bits): Signal from network indicating it’s ready to accept the next flit (??why 8?)
    • Flow Control
      • Virtual Channel flow control
      • Router with input and output controller
      • Input controller has buffer and state for each VC
      • Inp. controller strips routing info from head flit
      • Flit arbitrates for output VC
      • Output VC has buffer for single flit
        • Used to store flit trying to get inp. buffer in next hop
    • Input and Output Controllers
    • NoC Issues
      • Basic difference between NoC and Inter-chip or Inter-board networks:
        • Wires and pins are ABUNDANT in NoC
        • Buffer space is limited in NoC
      • On-Chip pins for each tile could be 24,000 compared to 1000 for inter-chip designs
      • Designers can trade wiring resources for network performance!
      • Channels:
        • On-Chip => 300 bits
        • Inter-Chip => 8-16 bits
    • Topology
      • The previous design used folded torus
      • Folded torus has twice the wire demand and twice the bisection BW compared to mesh
      • Converts plentiful wires to bandwidth (performance)
      • Not hard to implement On-Chip
      • However, could be more power hungry
    • Flow Control Decision
      • Area scarce in On-Chip designs
      • Buffers use up a LOT of area
      • Flow control with less buffers are favourable
      • However, need to balance with performance
        • Dropping pkt. FC requires least buffer but at the expense of performance
        • Misrouting when enough path diversity
    • High Performance Circuits
      • Wiring regular and known at design time
      • Can be accurately modeled (R, L, C)
      • This enables:
        • Low swing circuit – 100mV compared to 1V
          • HUGE power saving
        • Overdrive produces 3 times signal velocity compared to full-swing drivers
        • Overdrive increases repeater spacing
          • Again significant power savings
    • Heterogeneous NoC
      • Regular topologies facilitate modular design and easily scaled up by replication
      • However, for heterogeneous systems, regular topologies lead to overdesigns!!
      • Heterogeneous NoCs can optimise local bottlenecks
      • Solution?
        • Complete Application Specific NoC synthesis flow
        • Customised topology and NoC building blocks
    • xPipe Lite
      • Application Specific NoC library
      • Creates application specific NoC
        • Uses library of NI, switch and link
        • Parameterised library modules optimised for frequency and low latency
      • Packet switched communication
      • Source routing
      • Wormhole flow control
      • Topology: Torus, Mesh, B-Tree, Butterfly
    • NoC Architecture Block Diagram
    • xPipes Lite
      • Uses OCP to communicate with cores
      • OCP advantages:
        • Industry wide standard for comm. protocol between cores and NoC
        • Allows parallel development of cores and NoC
        • Smoother development of modules
        • Faster time to market
    • xPipes Lite – Network Interface
      • Bridges OCP interface and NoC switching fabric
      • Functions:
        • Synch. Between OCP and xPipes timing
        • Packeting OCP transaction to flits
        • Route calculation
        • Flit buffering to improve performance
    • NI
      • Uses 2 registers to interface with OCP
        • Header reg. to store address (sent once)
        • Payload reg. to store data (sent multiple times for burst transfers)
      • Flits generated from the registers
        • Header flit from Header reg.
        • Body/payload flits from Payload reg.
      • Routing info. in header flit
        • Route determined from LUT using the dest. address
    • Network Interface
      • Bidirectional NI
      • Output stage identical to xPipes switches
      • Input stage uses dual-flit buffers
      • Uses the same flow control as the switches
    • Switch Architecture
      • xPipes switch is the basic building block of the switching fabric
      • 2-cycle latency
      • Output queued router
      • Fixed and round robin priority arbitration on input lines
      • Flow control
        • ACK/nACK
        • Go-Back-N semantics
      • CRC
    • Switch
      • Allocator module does the arbitration for head flit
      • Holds path until tail flit
      • Routing info requests the output port
      • The switch is parameterisable in:
        • Number of input/output, arbitration policy, output buffer sizes
    • Switch flow control
      • Input flit dropped if:
        • Requested output port held by previous packet
        • Output buffer full
        • Lost the arbitration
      • NACK sent back
      • All subsequent flits of that packet dropped until header flit reappears
        • (Go-Back-N flow control)
      • Updates routing info for next switch
    • xPipes Lite - Links
      • The links are pipelined to overcome interconnect delay problem
      • xPipes Lite uses shallow pipelines for all modules (NI, Switch)
        • Low latency
        • Less buffer requirement
        • Area savings
        • Higher frequency
    • xPipes Lite Design Flow
    • IBM CoreConnect
    • CoreConnect Bus Architecture
      • An open 32-, 64-, 128-bit core on-chip bus standard
      • Communication fabric for IBM Blue Logic and other non-IBM devices
      • Provides high bandwidth with hierarchical bus structure
        • Processor Local Bus (PLB)
        • On-Chip Peripheral Bus (OPB)
        • Device Control Register bus (DCR)
    • Performance Features
    • CoreConnect Components
      • PLB
      • OPB
      • DCR
      • PLB Arbiter
      • OPB Arbiter
      • PLB to OPB Bridge
      • OPB to PLB Bridge
    • PLB
    • Processor Local Bus
      • Fully synchronous, supports up to 8 masters
      • 32-, 64-, and 128-bit architecture versions; extendable to 256-bit
      • Separate read/write data buses, enables overlapped transfers and higher data rates
      • High Bandwidth Capabilities
        • Burst transfers, variable and fixed-length supported
        • Pipelining
        • Split transactions
        • DMA transfers
        • No on-chip tri-states required
        • Cache Line transfers
        • Overlapped arbitration, programmable priority fairness
    • Processor Local Bus (cont’d.)