TRACK G: An Innovative multicore system architecture for wireless SoCs/ Alon Yaakov


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • History data - i.e. chest
  • History data - i.e. chest
  • Task scheduling must account for worst case execution time per task
  • Cache coherencyWhen a shared data resides in the cache of one core it is unaware of changes to data made by other coresThis is best solved using MESI / ACE protocols
  • Tasks are pushed by the SW -> flexibility
  • TRACK G: An Innovative multicore system architecture for wireless SoCs/ Alon Yaakov

    1. 1. May 1, 2013An Innovative Multicore SystemArchitecture for Wireless SoCsAlon YaakovDSP Architecture Manager, CEVA
    2. 2. May 1, 2013Multicore in Embedded SystemDefining the Problem• Control-plane– Synchronization between cores– Semaphores– Message passing using mailbox mechanism– Snooping mechanism– Interrupt handling• Data-planeEqualizationAntenna Processing Error CorrectionThis will be thefocus of today’spresentation
    3. 3. May 1, 2013Outline• Multicore Challenges• The CEVA-XC Solution
    4. 4. May 1, 2013Multicore Challenges1. Partitioning> Task partitioning onto different chip resources> Data partitioning onto different chip resources2. Resource sharing> Memories, buses, system I/Fs, peripherals, etc.3. Scheduling> Allocating tasks/data4. Data sharing> Transferring data between enginesDSP ADSP CDSP BCTCMLDFFTApplication ?
    5. 5. May 1, 2013• Tasks– Parts of an algorithm running in a sequential order– A task must have a defined input and output datastructure (packets)Challenge 1: Task PartitioningError CorrectionEqualizationAntenna ProcessingMLDFFTChestimationFFTChestimationReorderingInterleaverInterleaverInterleaverCTCCTCCTCConcatenation & CRCcheckerConcatenation & CRCcheckerConcatenation & CRCcheckerTaskData
    6. 6. May 1, 2013Challenge 1: Task PartitioningHW Offloading• Parts of the algorithm are more suited for HWacceleration– Well known algorithms that require little programmability– Heavy computational effortMLDFFTChestimationFFTChestimationReorderingInterleaverInterleaverInterleaverCTCCTCCTCConcatenation & CRCcheckerConcatenation & CRCcheckerConcatenation & CRCchecker
    7. 7. May 1, 2013Challenge 1: Data Partitioning• Several cores are used to process differentinput data packets• Suitable for homogeneous systems• Shared memory is used for storing history data– Core must wait for data to update before using it Latency• The entire program code is used by all cores– Core suffers stall cycles if L1 memory is small
    8. 8. May 1, 2013Challenge 1: PartitioningOK, Now What?• Efficient partitioning is dependent on thehardware platform• Building the optimal system depends on thepartitioning• There is no single optimal solution– Each approach has its merits• Partitioning can be eased by starting with areference that can be used as a basis
    9. 9. May 1, 2013Challenge 2: Resource Sharing• Resource types– DSP cores– HW accelerators– Memories– Buses– DMA• Resource sharing creates contentionMemory
    10. 10. May 1, 2013Challenge 2: Resource SharingAvoiding Contentions• If possible avoid contentions by duplicating HW– Multiple DMAs– Duplicated HW accelerators– Multilayer BUS– Partition memory into blocks enabling concurrent access• Throughput and latency govern the minimum amount ofhardware resourcesMemory Memory Memory Memory
    11. 11. May 1, 2013Challenge 2: Resource SharingArbitration• When a simple set of known rules can be defined aresource can be shared using a HW arbiter• QoS– Priority– Bandwidth allocation (weight)– Well known algorithms (round robin)• Arbitration is based on timesharing of resources SchedulingMemoryArbiter
    12. 12. May 1, 2013Challenge 3: Scheduling• How do we assign and schedule tasks tocores?Application ?Concatenation & CRCcheckerMLDFFTChestimationFFTChestimationReorderingInterleaverInterleaverInterleaverCTCCTCCTCConcatenation & CRCcheckerConcatenation & CRCchecker DSP ADSP CDSP BCTCMLDFFT
    13. 13. May 1, 2013Challenge 3: SchedulingStatic Scheduling• Tasks are statically assigned to DSP cores• Design phase includes task scheduling– Data flow is fixed– Suitable when the load on each task is fixedCTC HW CoreDSP CDSP BMLD HW CoreDSP AFFT HW CoreMLDFFTChestimationFFTChestimationReorderInterleaverInterleaverInterleaverCTCCTCCTCConcatenat& CRCcheckerConcatenat& CRCcheckerConcatenat& CRCchecker
    14. 14. May 1, 2013Challenge 3: SchedulingDynamic SchedulingConcatenation & CRCcheckerDSP CDSP BDSP A CTCMLDFFTMLDFFTChestimationFFTChestimationReorderInterleaverInterleaverInterleaverCTCCTCCTCConcatenation & CRCcheckerConcatenation & CRCcheckerMASTER(Scheduler)> A scheduler dynamically assigns tasksto cores> Scheduler algorithm selects the bestcore to execute the task> Processing capabilities> Locality of data> Load balance> Suitable for complex designs withvariable processing loadand QoS
    15. 15. May 1, 2013Challenge 4: Data SharingMemory Hierarchy• Internal L1 memory– Fast memory with no access penalty– Small / medium size– Dedicated per core• External memory– Can be on-chip (L2) or off-chip (i.e. DDR)– Slow memory with access penalty– Large size– Shared among several cores– Contentions
    16. 16. May 1, 2013Challenge 4: Data SharingUsing Cache• When shared data is used, a cache system can beused to reduce the stall count– Statistically reduces memory stalls, but notdeterministic• Used only for accessing narrow data width– Cache should be used for control data– Not recommended for vector DSP data flow• Large caches• Many stall cycles How to share vector data?
    17. 17. May 1, 2013Challenge 4: Data SharingPre-Fetching Data• A task cannot start until its preceding task completes• If we can schedule the next task to be executed we canpre-fetch its input data– Using static scheduling the data flow is known– Using dynamic scheduling the scheduler must handle datamove prior to activating a taskMLD ReorderingFFTChestimationFFTChestimationInterleaverInterleaverInterleaverCTCCTCCTCConcatenat& CRCcheckerConcatenat& CRCcheckerConcatenat& CRCcheckerDMADMADMADMADMADMADMADMADMADMADMADMADMADMA
    18. 18. May 1, 2013Challenge 4: Data SharingPre-Fetching using DMA• DMA transfer must wait for the following conditions:– Source data is available– Destination data can be written (i.e. allocated memory is free)• DMA activation schemes– Real-Time SW  Programmable, large MIPS overhead– HW system events  Not programmable– Queue manager  Programmable, no MIPS overhead
    19. 19. May 1, 2013Challenge 4: Data SharingPre-Fetching using DMA with Queue Manager• A Queue is a list of tasks handled in a FIFO manner• Each DMA queue contains all DMA tasks related with data flowchannel• DMA tasks are pushed to the queue– DSP software (i.e. static scheduling)– System scheduler (i.e. dynamic scheduling)• Tasks are automatically activated using HW or SW events– Source data is available & destination memory is freeFFTChestimationDMA
    20. 20. May 1, 2013Outline• Multicore Challenges• The CEVA-XC Solution
    21. 21. May 1, 2013CEVA-XC4000 Multicore SolutionOptionalCache ctrlACE
    22. 22. May 1, 2013MUST™ Multi-core System TechnologyOverview• Fully featured data cache– Non-blocking, software operations, Write-Back &Write-Through• Advanced support for cache coherency– Based on ARM’s leading AMBA-4 ACE™ technology• Advanced system interconnect– AXI-4 - easy system integration and high Quality ofService (QoS)– Multi-layer FIC (Fast Inter-Connect) - low latency, highthroughput master and slave ports– Multi-level memory architecture using local TCMs andhierarchy of caches
    23. 23. May 1, 2013MUST™ Multicore System Technology– Cont.• Data Traffic Manager– Automated data traffic management without DSP intervention• Comprehensive software development support– Advanced multicore debug and profiling– Complete system emulation with real hardware– Hardware abstraction layer (HAL) including drivers and system APIs• Support for homogeneous and heterogeneous clusters ofmultiple DSPs and CPUs– Support for advanced resource management and sharing– Flexible task scheduling for different system architectures: dynamic,event based, data driven, etc.
    24. 24. May 1, 2013> Allows multiple cores to use shared memory without any softwareintervention> Superior performance to SW coherency> Simplifying software development> Easy SW partitioning and scalingfrom single core to multi-core> External memory can bedynamically partitioned intoshared and unique areas> Minimizing system memories size> Flexible memory allocation speed upthe SW development> Snooping is only applied to shared areasCache Coherency Support
    25. 25. May 1, 2013Data Traffic Management– Data Traffic Manager– Based on Queue Manager and Buffer Manager Structures• Queue Manager - Maintains multiple independent queues of “tasks”• Buffer Manager- Autonomously tracks data status of source anddestination buffers– Data transfers are automatically managed based on tasksstatus, input and output data buffers load– Automatic data traffic management and DSP offloading– Prioritized scheduling for guaranteed QoS– Low latency packet transfers without software intervention– Results in lower memory consumption and improved systemperformance
    26. 26. May 1, 2013Data Traffic Manager– Allows sharing a resource among multiple cores via a shared queue• Tasks are executed based on priority and buffer status• Prevents starvation and deadlocks– Allows a single core to work with multiple queues• The core read / writesfrom / to its buffers(can be local or external)• All data transfersbetween cores andaccelerators areperformed automaticallyvia the data trafficmanager
    27. 27. May 1, 2013Dynamic Scheduling– Dynamic scheduling in symmetric systems– A clustered system based on homogenous DSP cores– Dynamic task allocation to DSP cores in runtime– Flexible choice of algorithmsbased on system load– Hardware abstraction usingtask oriented APIs– Shared external memories– FIC interface for low-latencyhigh-bandwidth data accesses• Commonly used in wirelessinfrastructure applications
    28. 28. May 1, 2013MUST™ Hardware Abstraction Layer(HAL)• MUST™ is assisted by user-friendly software support– Abstracts the queues, buffers, DMA and caches• The software package includes:– Drivers and APIs– Full system profiling– Graphical interface via CEVA ToolBox
    29. 29. May 1, 2013Multicore Modeling and Simulation• Simulating any number of cores– Support for symmetric and asymmetric configurations• Support for ARM CADI (Cycle Accurate Debug Interface)– Including connectivity to ARM’s Real-View debugger• Comprehensive multi-core simulation support– Synchronization, system browsing, shared memories , inter connect, accelerator simulation,cross triggering, AXI / FIC I/F– Support for user-definedcomponents• ESL tools integration withfull debug capabilities– Compliant with TLM 2.0– Full support for Carbonand Synopsys
    30. 30. May 1, 2013Co-processor Portfolio forWireless Modems– Wide range of co-processors offering power-efficientDSP offloading at extreme processing rates• A complete wireless platform addressing all major modem PHYprocessing requirements• Offering flexible hardware-software partitioning• Customers can focus on differentiation via DSP software– Unique automated data traffic managementbetween DSP memory and hardware accelerators• Allows fully parallel processing support• Based on data traffic manager
    31. 31. May 1, 2013Co-processor Portfolio for WirelessModems – Cont.• Optimized tightly coupled extensions (TCE)– MLD – Maximum Likelihood MIMO detectors• Supports up to four MIMO layers• Achieves near ML performance– De-spreader – 3G De-spreader units• Supports all WCDMA and HSDPA channels• Scalable up to 3GPP HSPA+ Rel-11– DFT / FFT• Supports multi radix DFTs• Includes NCO correction– Viterbi• Programmable K and r values• Supports tail biting– LLR processing and HARQ combining• Supports LTE de-rate matching• Significantly reduces HARQ memory buffer sizesDramatically reducestime-to-market
    32. 32. May 1, 2013Putting It All TogetherA Cluster of Four CEVA-XC DSPs> Processor Level> Fixed-point and Floating-pointVector DSPs> Running over 1GHz> Data Cache> Platform Level> Complete set of tightly coupled co-processor units> Automated DSP offloading using data trafficmanagement> System Level> Full cache coherency support> AMBA-4 and FIC system interfaces> Software Development Support> HAL using drivers and APIs> Comprehensive system debug & profiling
    33. 33. May 1, 2013Thank You