Soc - Intro, Design Aspects, HLS, TLM

Delivered by:
Subhash Iyer,
Program Head,
Soft Polynomials (I) Pvt. Ltd., Nagpur
(CDAC ATC)
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 1

2Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.

 Introduction
 What is SoC ?
 SoC characteristics
 Benefits and drawbacks
 Solution
 Major SoC Applications
 Summary

 Technological Advances
 today’s chip can contains 100M transistors
 transistor gate lengths are now in term of nano
meters
 approximately every 18 months the number of
transistors on a chip doubles – Moore’s law
 The Consequences
 components connected on a Printed Circuit Board
can now be integrated onto single chip
 hence the development of System-On-Chip design

System on a board
System on a Chip

 Introduction
 What is SoC ?
 Solution
 Summary

Version A:
The VLSI manufacturing technology advances has
made possible to put millions of transistors on a
single die. It enables designers to put systems-on-a-
chip that move everything from the board onto the
chip eventually.
Version B:
SoC is a high performance microprocessor, since we
can program and give instruction to the uP to do
whatever you want to do.
Version C:
SoC is the efforts to integrate heterogeneous or
different types of silicon IPs on to the same chip,
like memory, uP, random logics, and analog circuitry.
All of the above are partially right, but not very
accurate!!!

• SoC not only chip, but more on “system”.
• SoC = Chip + Software + Integration
• The SoC chip includes:
• Embedded processor
• ASIC Logics and analog circuitry
• Embedded memory
• The SoC Software includes:
• OS, compiler, simulator, firmware, driver, protocol
stack
• Integrated development environment (debugger, linker,
ICE)
• Application interface (C/C++, assembly)
• The SoC Integration includes :
• The whole system solution
• Manufacture consultant
• Technical Supporting

 A typical digital system design involves a significant
amount of custom logic circuitry, but also includes
pre-designed major components, such as processors,
memory units and various types of input/output (I/O)
interfaces.
 In the traditional approach for designing such
systems, a new integrated circuit (IC) chip is created
for the custom logic circuits, but each pre-designed
component is included as a separate chip
 Different approach for realizing digital systems,
called embedded system design. It leverages the
advanced capabilities of today's IC technology by
implementing many of the components of the system
within a single chip, such as a field programmable
gate array (FPGA).

 Offer large logic capacity, exceeding several
million equivalent logic gates, and include
dedicated memory resources
 Include special hardware circuitry that is
often needed in digital systems, such as
digital signal processing (DSP) blocks (with
multiply and accumulate functionality) and
phase-locked loops (PLLs) (or delay-locked
loops (DLLs)) that support complex clocking
schemes
 Support a wide range of interconnection
standards, such as double data rate (DDR
SRAM) memory, PCI and high-speed serial
protocols.

 Introduction
 What is SoC ?
 Solution
 Summary

Top Level Design
Unit Block Design
Integration and Synthesis
Trial Netlists
System Level Verification
Timing Convergence
& Verification
Fabrication
DVT
DVT Prep
6 12 12 4 14 ?? 5 8 Time in Weeks
Time to Mask order48
61
Unit Block Verification
ASIC Typical Design Steps • Typical ASIC design
can take up to two
years to complete

Top Level Design
Unit Block Design
Integration and Synthesis
Trial Netlists
System Level Verification
Timing Convergence
& Verification
Fabrication
DVT
DVT Prep
4 14 5 4
Time in Weeks
Time to Mask order24
33
Unit Block Verification
4 2
• With increasing Complexity of
IC’s and decreasing Geometry, IC
Vendor steps of Placement,
Layout and Fabrication are
unlikely to be greatly reduced
• In fact there is a greater risk
that Timing Convergence steps
will involve more iteration.
• Need to reduce time before
Vendor Steps.
• Need to consider Layout issues
up-front.
SoC Typical Design Steps

 Design reuse is facilitated if “standard”
internal connection buses are used .
 All cores connect to the bus via a standard
interface .
 Any-to-any connections easy but …
 Not all connections are necessary .
 Global clocking scheme .
 Power consumption .
 Standardization is being addressed by the
Virtual Socket Interface Alliance (VSIA)

• AMBA (Advanced Microcontroller Bus Architecture)
is a collection of buses from ARM for satisfying a
range of different criteria.
• APB (Advanced Peripheral Bus): simple strobed-
access bus with minimal interface complexity.
Suitable for hosting peripherals.
• ASB (Advanced System Bus): a multimaster
synchronous system bus.
• AHB (Advanced High Performance Bus): a high-
throughput synchronous system backbone. Burst
transfers and split transactions.

• One solution to the design productivity
gap is to make ASIC designs more
standardized by reusing segments of
previously manufactured chips.
• These segments are known as “blocks”,
“macros”, “cores” or “cells”.
• The blocks can either be developed in-
house or licensed from an IP company.
• Cores are the basic building blocks .

• Soft Macro
– Reusable synthesizable RTL or netlist of generic library elements
– User of the core is responsible for the implementation and layout
• Firm Macro
– Structurally and topologically optimized for performance and area
through floor planning and placement
– Exist as synthesized code or as a netlist of generic library elements
• Hard Macro
– Reusable blocks optimized for performance, power, size and
mapped to a specific process technology
– Exist as fully placed and routed netlist and as a fixed layout such
as in GDSII format .

Reusability
portability
flexibility
Predictability, performance, time to market
Soft
core
Firm
core
Hard
core

• Locating the required cores and associated
contract discussions can be a lengthy
process
– Identification of IP vendors
– Evaluation criteria
– Comparative evaluation exercise
– Choice of core
– Contract negotiations
• Reuse restrictions
• Costs: license, royalty, tool costs
– Core integration, simulation and verification

 MPSoC is a system-on-chip that contains multiple
instruction-set processors (CPUs).
 The typical MPSoC is a heterogeneous
multiprocessor: there may be several different
types of processing elements (PEs), the
memory system may be heterogeneously
distributed around the machine, and the
interconnection network between the PEs and
the memory may also be heterogeneous.
 MPSoCs often require large amounts of
memory. The device may have embedded
memory on-chip as well as relying on off-chip
commodity memory.

 These chips
have:
• one (several)
processors
• large amounts of
memory
• bus-based
architectures
• peripherals
• coprocessors
• and I/O channels

 Introduction
 What is SoC ?
 Solution
 Summary

• There are several benefits in integrating a
large digital system into a single integrated
circuit .
• These include
– Lower cost per gate .
– Lower power consumption .
– Faster circuit operation .
– More reliable implementation .
– Smaller physical size .
– Greater design security .

• The principle drawbacks of SoC design
are associated with the design pressures
imposed on today’s engineers , such as :
– Time-to-market demands .
– Exponential fabrication cost .
– Increased system complexity .
– Increased verification requirements .

 Why does it take longer to design SOCs compared to
traditional ASICs?
 We must examine factors influencing the degree
of difficulty and Turn Around Time (TAT) (the time
taken from gate-level netlist to metal mask-ready
stage) for designing ASICs and SOCs.
 For an ASIC, the following factors influence TAT:
• Frequency of the design
• Number of clock domains
• Number of gates
• Density
• Number of blocks and sub-blocks
 The key factor that influences TAT for SOCs is system
integration (integrating different silicon IPs on the
same IC).

 Introduction
 What is SoC ?
 Solution
 Summary

• Overcome complexity and verification issues by
designing Intellectual Property (IP) to be re-
usable .
• Done on such a scale that a new industry has been
developed.
• Design activity is split into two groups:
– IP Authors – producers .
– IP Integrators – consumers .
• IP Authors produce fully verified IP libraries
– Thus making overall verification task more
manageable
• IP Integrators select, evaluate, integrate IP from
multiple vendors
– IP integrated onto Integration Platform designed
with specific application in mind

IP cores are classified into three
distinct categories:
 Hard IP Cores
 Firm IP Cores
 Soft IP Cores

Hard IP cores consist of hard layouts
using particular physical design libraries
and are deliverid in masked-level
designed blocks (GDSII format). The
integration of hard IP cores is quite
simple, but hard cores are technology
dependent and provide minimum
flexibility and portability in
reconfiguration and integration.

Soft IP cores are delivered as RTL
VHDL/Verilog code to provide functional
descriptions of IPs. These cores offer
maximum flexibility and reconfigurability
to match the requirements of a specific
design application, but they must be
synthesized, optimized, and verified by
their user before integration into designs.

Firm IP cores bring the best of both
worlds and balance the high performance
and optimization properties of hard IPs
with the flexibility of soft IPs.These cores
are delivered in form of targeted netlists
to specific physical libraries after going
through synthesis without performing the
physical layout.

Resusability
portability
flexibility
Predictability, performance, time to market
Soft
core
Firm
core
Hard
core

 Introduction
 What is SoC ?
 Solution
 Summary

eS/W: Current application complexity
 Set-top box: >1 million lines of code
 Digital audio processing: >1 million lines of code
 Recordable DVD: Over 100 person-years effort
 Hard-disk drive: Over 100 person-years effort
In multimedia systems
 S/W cost (licenses) 6X larger than H/W chip cost
 eS/W uses 50% to 80% of design resources
 eS/W now an essential part of SoC products

 Speech Signal Processing .
 Image and Video Signal Processing .
 Information Technologies
 PC interface (USB, PCI,PCI-Express, IDE,..etc)
Computer peripheries (printer control, LCD
monitor controller, DVD controller,.etc) .
 Data Communication
 Wireline Communication: 10/100 Based-T, xDSL,
Gigabit Ethernet,.. Etc
 Wireless communication: BlueTooth, WLAN,
2G/3G/4G, WiMax, UWB, …,etc

• Consumer devices,
• Networking,
• Communications, and
• other segments of the electronics industry.
microprocessor, media processor,
GPS controllers, cellular phones,
GSM phones, smart pager ASICs,
digital television, video games,
PC-on-a-chip

Systems on chip are everywhere
Technology advances enable increasingly more complex designs
Central Question: how to exploit deep-submicron
technologies efficiently?

 Introduction
 What is SoC ?
 Solution
 Summary

 Technological advances mean that complete
systems can now be implemented on a single
chip .
 The benefits that this brings are significant in
terms of speed , area and power .
 The drawbacks are that these systems are
extremely complex requiring amounts of
verification .
 The solution is to design and verify re-
useable IP .

Introduction to SoC Design Aspects

 At each level of circuit abstraction, the circuit is equivalent and
performs the same target operation, but its structural
components (and hence the component’s granularity) are
different, and the design issues may be different

 Embedded applications in multimedia,
wireless communications or
networking domain were implemented
on Printed Circuit Boards (PCBs).
 Composed of discrete Integrated
Circuits (ICs)
 General Purpose Processors
 Digital Signal Processors
 Application Specific Integrated Circuits
 Memories
 Further peripherals.
 Communication between discrete
processing elements and memories is
realized by shared bus architectures
(like PCi Express)

 The transition is from board level integration towards System-on-
Chip (SoC) implementations of embedded applications.
 Today multiple heterogeneous processing elements and memories
can be integrated on a single chip
 Increased performance
 Reduced cost
 Improved energy efficiency
 This trend originates from tremendous increase in features as
well as the multitude of co-existing standards.
 Resulting functional complexity clearly promotes Software
enabled solutions to achieve the required flexibility and cope
with the demanding time-to-market conditions.
 However, stringent energy efficiency constraints of mobile
applications and cost sensitive consumer devices prohibit the use
of general purpose processors.
 Tight cost and performance requirements of versatile embedded
systems lead to application specific heterogeneous multi-
processor architectures

 Classical vertical partitioning approach to HW/SW Codesign, where the
performance critical parts are implemented as dedicated HW blocks and
the rest is executed in SW, is no longer applicable.
 Instead HW/SW Co-design can be seen as:
 Multi-dimensional horizontal mapping problem of an application running on a
heterogeneous multiprocessor platform.
 During the mapping process,
 Exploit application inherent parallelism to achieve performance at reasonable
cost.
 For the computationally intensive portions of typical embedded
applications the extraction of Task Level Parallelism (TLP) is mostly
straight forward:
 The partitioning into a set of loosely coupled functional blocks can be
naturally derived from the algorithmic block diagram

 Two major aspects
 Processing : A set of processing elements has to
be provided for the efficient execution of the
functional tasks.
 Communication mapping: The inter-task data
exchange has to be mapped to a communication
architecture.
 Only a joint consideration of architectural
choices in both areas bears the opportunity
for near optimal quality of results.
 Recent architectural advances offer a huge
design space with enormous potential for
optimization

 Bus paradigm as inherited from the PCB era
constitutes the major power and performance
bottleneck.
 Chip-wide communication is envisioned to be handled
by full-scale Network-on-Chip (NoC) architectures.
 Network-on-Chip architectures
 Resolve the physical issues
 Address the functional aspects of on-chip
communication.

 So far, the dynamic priority based arbitration scheme of shared busses
creates a mutual dependency between all components connected to the
bus.
 Due to this lack of traffic management capabilities every change in the
traffic requirements of the application requires a re-design of the bus
architecture.
 Instead, NoC architectures take advantage of sophisticated networking
algorithms to provide elaborated traffic-management capabilities.
 By that, the ad-hoc communication mapping is replaced with a
disciplined allocation of the required communication services and the
on-chip network takes care to provide the required resources.
 From the system architecture perspective, this separation of the
offered communication services from the architectural resources can be
considered as a virtualization of the actual communication
architecture.
 This virtualization effectively decouples the mapping problem for
communication and computation.
 The price to pay for the physical and functional benefits of NoC based
communication is a significant penalty in terms of chip area as well as
transfer latency.Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 9

 Programmable processing
elements achieve significant
gains with respect to
performance and computational
efficiency by:
 tailoring instruction set
 micro architecture to the
respective set of tasks
 Examples are innovative
architectures exploiting
 Instruction Level Parallelism (ILP)
 Data Level Parallelism (DLP)
 Despite the increased
computational performance,
the effective performance is
often constricted by the
communication architecture,
since memory accesses latency
does not keep pace with the
processing power.

 General purpose processors resolve the memory access
bottleneck by using sophisticated cache and memory hierarchies.
 This is generally not applicable for embedded applications due to
the poor memory locality of stream driven and packet based data
processing.
 Instead, processor architectures are equipped with hardware
supported Multi-Threading (HW-MT) to perform task switches
with virtually no performance overhead.
 By that, the application inherent TLP is exploited with the
purpose of hiding memory latency, which effectively leads to a
significant increase in the processor utilization.
 This technique is already widely employed in the network
processor domain but recently finds its way into advanced
multimedia and signal processing platforms.
 In the light of the latency issue caused by NoC architectures, the
importance of memory hiding techniques is likely to increase in
the future.

 Taking the above considerations together,
future SoCs can be considered as
 NoC enabled multi-processor architectures.
 On-chip communication backbone connects a
large number of heterogeneous processing
clusters and global storage elements.
 Individual processing clusters consist of one
or few application specific programmable
kernels together with tightly coupled
instruction and data memories as well as
local peripherals.

 To cope with the resulting design complexity:
 Achieve virtualization of the architectural resources,
 They can be allocated by the system architect in a deterministic way.
 This virtualization is provided by
 NoC approach for communication part
 SW and HW operating systems for the control and data processing
respectively.
 Divide-and-conquer oriented design paradigm
 Enables individual optimization of the architectural elements
 The price for these benefits
 A penalty in terms of chip area,
 Generally considered to be of constantly decreasing importance.

 HW/SW Co-design of a given embedded
application is defined to
 Architect a heterogeneous MP-SoC platform
 Allocate the architectural resources for the
execution of the application.
 Architecture virtualization resolves the
mutual dependencies in the mapping process
 Trade-offs in the design space still require a
joint consideration of application and
architecture as well as communication

 For example:
 Latency of a more complex on-chip network
can be compensated by either:
 introducing memory hierarchy
 employing hardware multi-threaded processor
kernels.
 Obviously, the resulting design space is
virtually infinite
 Architecting and the mapping phase cannot
be considered independently without
sacrificing quality of results.

 What is needed is:
 A system level design methodology
 Corresponding tool supported modeling
framework
 Transaction-Level Modeling (TLM)
 Advocated by the SystemC language
 The system level design paradigm
 Already incorporated into state-of-the-art
Electronic System Level (ESL) tools

 TLM greatly improves
 modeling efficiency
 simulation speed
 Abstracts from
 Low-level communication
details of the Register
Transfer Level (RTL),
 To complete transaction
 Is usually employed in a
byte and cycle accurate
fashion
 We will look more at
packet-level TLM paradigm
 Cycle-level TLM is still too
detailed to explore large
design spaces.

 Since communication becomes the driving design paradigm
for MP-SoC
 Exploration framework is based on a sophisticated,
communication centric timing model:
 Generic synchronization interface
 Defines a concise set of communication primitives,
 Follows the Open Core Open Core Protocol (OCP)
 Not biased towards any specific communication architecture.
 Additionally the primitives incorporate timing-annotation to
achieve reasonable timing accuracy at the highly abstract
packet-level TLM layer
 The communication timing model captures the impact on
performance of the interconnection architecture.
 This communication timing model supports the full
spectrum of available and proposed communication
architectures ranging from today’s shared busses to the
emerging NoC paradigm.

 Implemented by means of a versatile
modeling framework for architecture
exploration and hardware/software
partitioning
 Key advantages:
 Modeling efficiency
 Higher simulation speed
 A declarative specification mechanism for better
design space exploration

 TLM is a method used for SoC Design
 To specify at a higher level of abstraction
 Involves Communication and Computation
Architectures
 Unified Timing Model aims to standardize the
TLM approach

Need to know why before what & how!!!

 Networking Domain
 Multimedia Domain
 Wireless Communications

 Constitutes
implementation of
networking standards
 IEEE, ITU, ETSI, etc work
out communication
standards
 The purpose of these
standards to achieve a high
degree of interoperability
 ISO/OSI reference model
has been providing a
common terminology

 Networking layer standards in the middle of
the ISO/OSI stack address a multitude of
higher layer application standards as well as
lower physical/link layer standards
 Major implementation challenge and effort
is of the networking layer
 Layer three multi-service access switches
are considered as one of the potential killer
applications for MP-SoC platforms, since
they combine the physical wire speed
throughput requirements with flexibility
constraints imposed by the individual
treatment of different service classes and
application characteristics.
 Today’s de facto networking layer standard
is given by the rather simplistic Internet
Protocol (IP).
 Lower level layers are nowadays built in as
ready made blocks
 Physical & link layer data rates of core
network equipment are imposing demanding
performance requirements
 Higher application layers are only present in
the terminal devices,
 So the relatively low to medium throughput
requirements allow for a software
implementation of the flexible and control
dominated functionality.

 Processing of all kinds of media data
 Pictures
 Audio
 Video decoding
 Video pixel processing
 2D/3D graphics
 Standards enable the exchange of media data as
well as device inter-operability
 MOPS: Mega Operations Per second

 Advances in processing capabilities and multimedia
algorithms together with increased user expectations fuels
a constant proliferation of new multimedia standards
 Digital audio decoding (AC3, OGG, MP3),
 Video decoding (MPEG2, MEPEG4, H.263, H.264, DivX,
quicktime)
 3D graphic processing (DirectX 9)
 Apart from the multitude and dynamics of multimedia
standards, a flexible implementation platform is also
mandatory to meet demanding cost constraints of
converging consumer electronics devices such as the
Advanced Set-Top Box (ASTB).
 Here the processing and communication fabrics have to be
shared among the multitude of supported multimedia
applications to limit implementation cost.

 Wireless communication applications aggressively use digital
signal processing to maximize bandwidth efficiency
 Again, a multitude of standards exists
 Each marks a local optimum in
 implementation cost
 Mobility
 power dissipation
 performance bandwidth efficiency
 Multimedia and wireless communication domains are converging
into a new generation of Personal Digital Assistant (PDA) or
SmartPhone devices
 PDAs have started to support a huge variety of travel and fun
related applications with much higher processing requirements,
like e.g. localization, navigation, travel assistant, video camera,
digital camera, picture editing, MP3 player or games
 Additionally, this kind of portable, multimedia enabled PDA
devices are obliged to support multiple communication
standards, both cable (USB, FireWire) and wireless (3G, WLAN).

 Summary of common trends:
 New features and value added
services: lead to exponentially
increasing processing performance
and communication requirements.
 Standards become more dynamic and
sophisticated and are introduced more
rapidly: calls for high flexibility of the
SoC implementation to meet the
resulting time-in-market as well as
time-in-market requirements.
 For mobile applications and cost
sensitive consumer electronic devices:
energy efficiency becomes the
prevailing cost factor
 Heterogeneous Multi-Processor SoC
(MP-SoC) platforms are generally
believed to meet the above
mentioned conflicting performance,
flexibility and energy efficiency
requirements of demanding
embedded applications
 Hence, in the course of an MP-SoC
platform design the partitioning of
a specific application is a task of
major importance

 Main Partitioning Principle
 Control dominated domain
 Data dominated domain
 This first order partitioning has major
influence on both the target processing and
communication elements as well as on the
appropriate design methodology.

 Examples

 Control-plane processing is characterized by:
 Moderate performance requirements,
 Huge amounts of functionality
 Calling for maximum flexibility
 Developed using an
 Integrated Design Environment (IDE) which is
 Architecture agnostic
 Software centric
 Software engineering techniques
 Object Oriented Programming (OOP) using
 Unified Modeling Language (UML)
 C++
 Java

 To increase the reuse of the control plane
Software (across multiple MP-SoC platform
generations):
 Hardware dependant Software (HdS) portions are
wrapped into:
 stack of middleware
 Real Time Operating System (RTOS)
 device driver layers
 Parallelism in Control Plane Processing:
 Instruction Level Parallelism (ILP)
 Extracted by a VLIW compiler
 Or a superscalar processor architecture
 Helps gain performance
 Task Level Parallelism
 Generally not possible due to huge amount of
functionality

 Data-plane processing is characterized by:
 Computationally intensive data manipulations
 Performance at high data rates
 Demand for high processing
 Demand for high communication performance.
 Rapidly evolving standards in all application
domains impose increasing flexibility
constraints.

 Need to reach performance requirements of networking,
multimedia and wireless communications applications
 Requires aggressively exploiting abundant inherent
parallelism available in data-plane processing tasks
because:
 Functionality can be straightforwardly partitioned into a set of
loosely coupled tasks with well predictable or even cyclo-
stationary execution timing
 A well confined data set is associated with a single activation
of an individual task.
 Data sets associated with successive activations of an
individual tasks are mostly independent.
 These spatial and temporal properties with respect to
second order task partitioning and data dependency can
already be identified during the algorithm development
stage and lead to an identification of coarse grain TLP.
 This application inherent TLP enables the concurrent and
parallel execution on MP-SoC platforms.

More about SoC design concepts next !!!

The mains aspects of
SoC architectural elements

 Macroscopic metrics for the classification
and evaluation of architectural elements
 Cost
 Performance
 Power Dissipation
 Computational Efficiency
 Flexibility

 Cost of embedded architecture is separated into
 Non Recurrent Engineering (NRE) cost for the initial design
 Recurring chip fabrication cost.
 NRE costs factor is caused by the
 Design effort for HW
 SW development
 Fabrication of the initial mask set.
 Typical NRE cost for 90 nm SoC
 10-100 Million USD design effort
 1 Million USD per mask set
 Fabrication cost determined by
 Silicon die area
 Packaging
 Number of pins
 Power dissipation requirements

 Performance of both computational
and communication architectures is
classified into:
 Latency
 Throughput
 Latency
 Absolute time passing between the
start and completion of a task,
 Throughput
 Number of accomplished tasks per
time.
 Communication throughput is
measured in bits per second (bps).
 Throughput of programmable
processing elements is measured in
Millions Instructions Per Second
(MIPS)
 MIPS measurement is not very
accurate

 Measured in Watt
 Denotes the energy per time required to operate
an embedded system
 Is an architecture metric of growing importance
 Battery lifetime of mobile devices immediately
depends on the energy consumption.
 Packaging cost depends on the heat dissipation
properties, which in turn depends on the power
consumption.
 Striving for low power and energy consumption
constitutes the key driver for architecture
differentiation of embedded SoC platforms

 Derived from performance and power
consumption
 Characterizes efficiency of a given
architectural element with a single value
 Computational efficiency of programmable
architectures is predominantly measured in
MIPS/Watt.
 Alternatively measured in energy
consumption per task (since MIPS
measurement is not very accurate)

 Related to the effort to change the
functionality of a given architectural
element
 In contrast to the previous metrics, flexibility
can be hardly measured in an accurate way.
 Nonetheless, in the context of rapidly
evolving functionality and standards of
embedded applications, architectural
flexibility is of major importance to achieve
both decreasing time-to-market as well as
increasing time-in-market

 A processing element (PE) provides the computational
resource to execute a given portion of the application
 Dedicated hardware implementation yields best
performance
 Programmable PEs are controlled by an instruction
stream in a highly flexible way
 The rather poor performance of programmable PEs
has ever fueled computer architecture research
towards parallelizing the execution of instructions
 Early efforts in parallel computer architectures are
classified according to the deployment of control-
and data-level parallelism
 SISD
 SIMD
 MIMD
 MISD

 SISD: Single Instruction Single Data
 Traditional von-Neumann kind of
computer architectures
 Sequentially execute a single instruction
stream on a single processing resource
 SIMD: Single Instruction Multiple Data
 Vector processing machines
 Perform a single instruction on multiple
data items in parallel
 Used in architectures for embedded DSP
and graphic applications
 Exploit inherent data-level parallelism
(DLP)
 MIMD: Multiple Instruction Multiple Data
 Traditional homogeneous multi-processor
type of architectures
 Employed in scientific supercomputers
 MISD: Multiple Instruction Single Data
 Rarely encountered class of
architectures,
 Exploit temporal ILP by:
 Setting pipeline stages
 Executing several instructions
simultaneously,

 Superpipelining:
 Uses deep execution pipelines to
increase the clock frequency
 Superscalarity
 Employs parallel functional units and
complex dispatcher architectures to
dynamically extract Instruction Level
Parallelism (ILP)
 Very Large InstructionWord (VLIW)
 Execute several statically scheduled
instructions on parallel functional
units,
 Hence the effort for ILP extraction is
moved into the compiler
 Hardware Multi-Threading (HW-MT)
 Such architectures are able to
concurrently pursue two or more
threads of control by providing
separate register resources for each
thread context
 Domain Specific (DS) Instruction Set
 Tailors the programmable PE to a
specific application domain
 Provide specialized functional units.
 DS processor examples are Digital
Signal Processors (DSPs) employed in
multimedia and wireless
communications, or Network
Processing Units (NPUs) for networking
applications

 The applicability of the above listed performance improvement
techniques depends on the considered set of target applications.
 Superpipelining and Superscalarity are heavily used in high
performance General Purpose Processor (GPP) architectures to
increase single thread performance of arbitrary applications on
the vast expense of silicon area and power dissipation.
 On the one hand, embedded applications are severely energy and
cost constrained, but still have significant performance and
flexibility requirements.
 The most promising approach to jointly optimize flexibility and
performance is to exploit coarse-grain TLP instead of ILP and map
the loosely coupled tasks to individually optimized PEs.
 This kind of embedded PEs mostly rely on the more power aware
performance optimization techniques, like VLIW, multi-threading
and a domain specific or even application specific instruction set.

 MIMD control parallelism plays an important
role in embedded SoC architectures
 Parallel execution of specialized PEs offers
 Chance for improving application performance
 Without sacrificing power efficiency

 Refers to the multiple
instantiation of identical PEs
 Corresponds to a single chip
implementation of the MIMD
principle
 Homogeneous multi-processing of
general purpose embedded micro
controllers
 Achieves the performance scaling
required for control-plane
processing portion of embedded
applications
 Also found for dataplane processing
in domain specific MP-SoC
platforms, where the identical
instruction set of the PEs is tailored
to a certain application domain

 Employs multiple PEs
 Different PEs individually tailored to a certain task or task set
 Dedicated optimization
 Applicable for the data-plane processing as it allows for a
manual and static task allocation
 The high degree of specialization in heterogeneous multi-
processing further optimizes computational efficiency for a well
defined set of target applications at the expense of generality

 Parallel execution
 Requires multiple computational resources
 More than one task can be active at the same
point in time.
 Concurrent execution
 Interleaved processing of several tasks on a
single resource,
 At any time only one task can be active

 Benefit of concurrent execution is
depicted in figure
 2 tasks are mapped to a single
processing element
 Both tasks are divided into 2
processing portions
 These are separated by a
communication request
 After Δtdelay the processing of the
first portion is finished and the task
is blocked for Δtresponse until the
request is accomplished.
 Instead of wasting the processor
resource during this period, the
processor context is swapped to the
second task by a scheduler.
 Utilization of the processor is
increased and the request latency
is hidden

The mains aspects of
SoC on-chip communication elements

 Basic cost, performance,
power, and flexibility
metrics apply.
 Additionally, Quality of
Service (QoS) metrics
known from the
networking application
domain are of increasing
importance to manage
complex on-chip traffic
 The scalability of the
communication
architecture gains growing
attention

 Bus based on-chip communication paradigm is
derived from the Printed Circuit Board (PCB)
domain.
 Examples:
 VME (Versa Module Eurocard bus)
 PCI (Peripheral Component Interconnect)
 Advantages:
 Easy programming model
 High flexibility
 Abundant availability of Intellectual Property (IP)
 Suited for small and medium scale embedded
systems where a small number of blocks
exchange moderate amounts of data.

 Implement master-slave communication
scheme,
 Active initiators along with passive target
modules are hooked to a shared
communication medium
 Typical masters:
 Processors
 DMA controllers
 Autonomous ASIC blocks,
 Typical slaves:
 Memories
 Co-processors
 Other peripherals
 Other components:
 Arbitration units: Grant the access to the
communication medium to one of the
competing master modules
 Decoder units: Activate the target module
based on the actual address and the address
map, which maps the target modules into
the bus address space

 Bandwidth
 Is the premier performance metric
 Denotes the maximum transfer capacity of the
bus
 Available bandwidth is measured in bits per
second
 Corresponds to the number of parallel data wires
divided by the bus clock period

 Pipelining:
 Well known technique to improve the communication
throughput
 Clock frequency is limited by the critical path
 Inserting an additional pipeline stage into the critical
path allows a higher clock frequency
 Yields a higher communication bandwidth
 Since the address decoder is usually integral part of the
critical path, bus transactions in high performance buses
are executed in separate address and data stages

 Burst modes:
 Improve communication throughput for the linear
access of subsequent addresses by a single
master
 Address counter is incremented automatically
 Next data item is transferred with every cycle
without renewed arbitration

 Unidirectional data links
 Distinguish on-chip buses from most on-board
buses
 The latter are based on tristate data wires to
maximize the utilization of expensive on-board
wires

 Hierarchy
 Common bus systems separate high
performance from low performance
communication
 Two buses with different speed
characteristics

 Multilayer bus architectures
 Provide dedicated point-to-point connections
between distinctive initiators and targets to
eliminate bandwidth bottlenecks
 Required de-multiplexer at the initiator side is
called input stages, the respective target
multiplexer is called output stage

 Crossbar bus architectures:
 Provide multiple parallel resources between initiators
and targets
 Significantly improve the traffic throughput
 Degree of parallelism may vary from partial crossbar
to full crossbar architectures, where the latter
provides an individual resource for each connected
target

 Arbitration:
 Can be based on various algorithms,
 Simple round robin
 Fixed, Configurable or dynamic priority schemes
 Static or Dynamic Time Division Multiple Access
(TDMA).
 Even more advanced algorithms are known to
further improve the quality of service.

 Locking of a bus:
 By a single master is a necessary feature to support
read-modify-write kind of semaphore operations.
 This feature is required by most micro-controller
architectures, which run operating systems
 Split transaction buses
 Allow the master to issue multiple requests without
waiting for a response, i.e. request and response are
separated
 Out-of-order execution
 Improves the bus throughput by reordering the sequence
of responses, depending on the availability of the slave
component
 This feature requires advanced state-machines in the
master modules to cope with non-deterministic
sequence of responses

 Physical Issues.
 Implemented using a standard cell based semi-custom implementation
flow
 Transmission wires are not physically optimized,
 timing closure issues and unreliable communication links.
 Examples of physical effects are crosstalk noise, electromagnetic
interference, and radiation-induced charge injection
 Synchronous Design.
 Most current bus architectures require all connected modules in a
single clock domain.
 Due to the parasitic capacities of long bus wires, strong driver
transistors are necessary to achieve timing closure
 Leads to power dissipation
 Future SoC designs will follow the Globally Asynchronous Locally
Synchronous (GALS) paradigm,
 Chip-wide wires will span multiple clock domains, which disqualifies
bus architectures as the future chip-level transport mechanism

 Traffic Management.
 Due to the rather simple arbitration mechanisms, shared buses
provide only rudimentary traffic management support.
 Since the communication pattern highly depends on the spatial
and temporal execution of the application tasks, meeting the
individual QoS requirements like throughput, jitter, or ordering
of the respective tasks is very challenging.
 This also causes the poor scalability of bus-based
communication infrastructures, since every change in the
traffic profile of one part of the application and every
additional component influences the other parts and requires
renewed balancing of the bus architectures.
 Interoperability.
 Although simple standard peripherals, like DMA, IRC, or
memories are available for respective bus systems, it is a
tedious and error-prone task to adapt complex IP blocks to a
specific bus architecture.
 So far efforts to create standard bus interfaces, have not been
successful

 Alternative on-chip communication
concepts To cope with the
limitations of shared bus
architectures forms the Networks
on Chip (NoC) design paradigm
 Aims to replace current adhoc
wiring of IP blocks with a
disciplined approach where full-
scale on-chip networks provide
communication services according
to the ISO/OSI reference model
 Problems in on-chip
communication like signal integrity
issues, link reliability, or Quality of
Service (QoS) are separately
resolved on the respective OSI
layer

 The four lower layers of the are of interest
 Physical Layer
 deals with the electrical aspects of the data
transmission
 E.g. signal voltages, clock recovery, and pulse shape
 Data Link Layer
 provides a reliable data transfer over the physical link.
 Error detection by means of block codes and error
correction mechanisms like:
 Automatic Repeat Request (ARQ)
 Forward Error Correction (FEC)
 Network Layer
 implements the arbitration algorithms, buffering
strategies and flow-control mechanisms
 So, the networking layer has dominant impact on the
performance and functional behavior of network.
 Transport Layer protocols
 establish and maintain end-to-end connections.
 The transport layer manages rate-based flow control,
performs packet segmentation and reassembly, and
ensures message ordering
 This abstraction hides the topology of the network,
and the implementation of the links that make up the
network

 The challenge in the development of Network-
on-Chip architectures is to combine the know-
how from both the networking and VLSI domain.
 Also the users of on-chip networks have to
understand basic networking principles:
 First the system architect has to specify design time
parameters of the selected NoC architecture like
topology, buffer sizes, arbitration algorithm.
 Later the platform programmer has to configure
runtime parameters like priorities, routing tables,
buffer management thresholds to take advantage of
the capabilities

 Transport layer is the first to provide
services which are independent of the
implementation of the network
 Enables the platform programmer to
develop embedded software independently
from the interconnect architecture
 A key ingredient in tackling the challenge of
decoupling the computation from
communication
 Interaction with the network becomes
deterministic, rather than prognostic or
reactive like in today’s bus based
communication architectures
 For complex multi-hop networks it is
difficult to provide uniform Quality of
Service (QOS) guarantees like lower
bandwidth bounds, or packet ordering for
the complete on-chip traffic
 To combine high resource utilization with
high QoS requirements of certain traffic
types, researchers in the field of computer
networks distinguish guaranteed services
and best effort service classes

 Guaranteed Services
 Require resource reservation for worst-case scenarios
 Can be expensive as guaranteeing the throughput for a
stream of data implies reserving bandwidth for the peak
throughput, even when its average is much lower.
 So, resources are often underutilized
 Best-effort Services
 So not reserve any resources, and hence provide no
guarantees.
 Best-effort services utilize resources well as they are
typically designed for average-case scenarios instead of
worst-case scenarios.
 Are also easy to configure,
 Require no resource reservation
 Main disadvantage: unpredictability of the effective
performance

 Networking layer is implemented by the
routing nodes of the NoC.
 Router based network implementations
classified as:
 Switching Mode
 Routing Mode
 Queuing
 Congestion Control

 Switching mode:
 Circuit switching
 Connections are set up by establishing a
conceptual physical path from a source to a
destination.
 Links can be shared between two connections
only at different points in time, by using the
time-division multiplexing (TDM) scheme
 Packet switching
 Data is divided into packets and every packet is
composed of a header and the payload.
 The header contains information that is used by
the router to switch the packet to the
appropriate output port

 Routing mode: applies to packet-switched networks and
defines the way packets are transmitted and buffered
between network nodes
 Store-and-forward
 An incoming packet is received and stored entirely before it is
forwarded to the next node.
 Worm-hole routing
 An incoming packet is forwarded as soon as the packet header is
evaluated and the next router guarantees that the complete packet
will be accepted.
 In case the next hob is blocked, the packet tail potentially blocks
other resources
 Virtual cut-through
 An incoming packet is forwarded as soon as the next router
guarantees, that the complete packet will be accepted.
 In case the next hob is blocked, the packet tail is stored in a local
buffer

 Queuing: Buffering strategies can be distinguished by the location of the buffers inside
the router.
 In the following, N denotes the number of bi-directional router ports.
 Input queuing:
 A router has a single input queue for every incoming link.
 Suffers from the so-called head-offline blocking problem, i.e. the router utilization saturates at
about 59%,
 Weak link utilization.
 Output queuing: `
 There are N output queues for every outgoing link resulting in N2 queues.
 Yields optimal performance,
 The costly N2-fold storage and wiring effort prohibits the implementation for a large number of
ports
 Virtual output queuing:
 Combines the advantages of input queuing and output queuing
 Avoids the head-of-line blocking problem.
 Each input port maintains a separate queue for each output port
 Key factor in achieving high performance using VOQ switches is the scheduling algorithm

 Congestion control:
 Packet switched networks without mechanisms for
bandwidth reservation may run into resource
contention and subsequent buffer overflow.
 Several solutions prevent packets from entering until
contention is reduced
 Packet discarding: Simply drops packets in case of buffer
overflow
 Credit based flow control: Packet loss is prevented in a
deterministic way by either signaling congestion via
separate wires (back-pressure) or the receiver regularly
informs the sender about the available buffer space
(window).
 Rate based flow control: the sender gradually adjusts the
traffic generation rate in response to control flow
messages from the receiver. Rate based flow control has
to be implemented by the transfer layer and potentially
suffers from instability due to long control loops

 Architectural trends
 Set the stage for the discussion of appropriate system
level design methodologies
 Processing elements
 Requirements for performance, power efficiency and
flexibility
 SIMD, VLIW, super-pipelining, and hardware multi-
threading exploit application inhérent instruction-, data-
, and task-level parallelism
 Communication: Bus Architectures Vs Network-on-
Chip
 Virtualization of architectural resources enables
’divide-and-conquer’
 Embedded control-plane processing tasks are executed
in the user space the Real Time Operating System
(RTOS),
 Embedded data-plane processing tasks are executed on
HW multi-threaded processing elements
 Global communication of control- and data-plane
processing elements is performed by elaborated on-chip
networks

High Level Synthesis
Low Power Design

 System level:
 Highest level circuit abstraction
 The system is specified as processes and tasks
 A mix of hardware and software.
 Concerned with overall system structure and information flow.
 Computer systems are described as an interconnected set of
processors, memories and switches
 Behavioral level, algorithmic level or high level
 Also called as instruction set level or algorithmic level.
 Focus is on the computations performed by an individual processor;
i.e., the way it maps sequences of inputs to sequences of outputs
 Architecture, microarchitecture, RTL
 Viewed as a set of interconnected storage elements and functional
blocks.
 Behavior of the system is described as a series of data transfers and
transformations between the storage elements
 Microarchitectural-level representation of the chip resources, such as
adders and subtractors, is determined along with decisions such as
single-cycle, multicycle, pipelined or superscalar implementation

 Logic level
 System is described as a network of gates and flip-flops,
 Behavior is specified by logic equations
 Circuit is represented in the form of a netlist at which level logic
realizations of functional blocks are determined
 Circuit or transistor level
 Circuit is a netlist of transistors
 Decisions such as how and what types of transistors will be used,
complementary CMOS, pass transistors, etc. are the main issues
 Physical or layout level
 System is specified in terms of the individual transistors of which it is
composed
 Behavior of the system can be described in terms of the network
equations
 Lowest level of circuit abstraction
 Chip is a sequence of layers (masks), each layer of which is composed
of polygons.
 It is this level that is transferred to the manufacturing process

 Design automation terminology,:
 Optimization
 Synthesis
 Analysis
 In circuit analysis, the behavior or
characteristics of a circuit are studied
 The task of synthesis is to take the
specifications of the behavior required for
a system and a set of constraints and goals
to be satisfied and to find a structure that
implements the behavior while satisfying
the goals and constraints
 Behavior, structure and physical design: 3
domains in which hardware is described
 “Behavior”:
 Refers to the ways in which the system or its
components interact with their environment
(mapping from inputs to outputs)
 interest is in what a design does, not in how it is
built
 “Structure”
 Refers to the set of interconnected components
that constitute the system (described by a netlist)
 Focus on constraints, such as area, cost and delay.
 “Physical” design
 Mapping of the structure onto the technology
 Ignores what the design is supposed to do
and binds its structure in space or to
silicon

 The automatic design process of VLSI circuits is called synthesis

 System-synthesis process partitions the tasks
into hardware, software and their
communications
 High-level synthesis process is the translation
from behavioral description to its equivalent
structural description
 Logic synthesis is the process of mapping
from the design at the RTL to a gate-level
representation that is suitable for input to
physical design

 Physical design then addresses aspects of chip
implementation
 Floor planning
 Placement
 Routing
 Extraction
 Performance analysis
 Output of physical design is the handoff
(“tapeout”) to manufacturing
 A generalized data stream, GDSII, stream file
 Verification of correctness
 Design rules
 Layout versus schematic
 Constraints (timing, power, reliability, etc.)

 During each phase of the synthesis process,
the functional equivalence of two
consecutive phases is to be checked to
ensure that they are functionally the same
 A power and timing analysis study can be
done by using compact models at the
transistor level
 At the physical level, more accurate power
and time analysis is possible through the
extraction of accurate parasitics

 High-level synthesis is the translation process
from a behavioral description to a structural
description

 Analogous to “compilation” that
translates a high-level language
program in C/C++ to an assembly
language program
 HLS Also known as behavioral-level
synthesis or algorithmic-level
synthesis.
 Constraints to be considered in HLS
are:
 Area
 Performance
 Power consumption
 Reliability
 Testability
 Cost.
 HLS synthesis allows a design engineer
to make decisions at an early stage of
the design cycle, thus ensuring
correct design.
 Typical steps involved are scheduling,
binding, allocation, etc.

 Advantages:
 Continuous and reliable design flow
 From system-level abstraction to RTL abstraction automatically without manual handling
 Automatic translations from high-level specifications in the form of C or SystemC to RTL description of
the circuit in the form of VHDL or Verilog.
 Shorter design cycle
 More automation: faster designs, lesser cost
 Fewer errors
 Synthesis process can be verified easily, so the chances of errors will be smaller.
 Correct design decisions at the higher levels of circuit abstraction can ensure that the errors are not
propagated to the lower levels, which are too detailed and costly to correct
 Easy and flexible to search the design space
 Synthesis system can produce several designs in a short time
 So, the designer has more flexibility to choose the proper design considering different trade-offs of
power, leakage, area and delay.
 Balanced degree of freedom for power optimization
 Power and performance optimization can be performed at any level of circuit abstraction
 As the level of abstraction goes lower, the complexity of the circuit increases
 Additionally, the degrees of freedom, and thus power reduction opportunities decrease
 Hence, high level or behavioral level is an attractive level and provides a balanced degree of freedom
for design space exploration.
 Documenting the design process
 Automated system can track design decisions and their effects
 Design debugging and continuation by third parties can be easily done
 Useful for macrocell-based design and the sale of designs as intellectual property cores
 Availability of circuit technology to more people
 Design expertise is moved into synthesis systems
 It becomes easier for a non-expert to produce a chip that eets a given set of specifications
 Cost of manpower required reduces

 The high-level synthesis process
takes a system in the form of a
hardware description language
(HDL) as input and generates an
optimal RTL description by:
 Compilation
 Transformation
 Scheduling
 Allocation
 Binding
 Other steps
 Power optimization
 Leakage optimization
 Register optimization
 Interconnect optimization
 Take place in synthesis either
sequentially or along with the
fundamental steps
 No fixed sequence for
performing various high-level
synthesis tasks
 They are independent of each
other
 Yet, these tasks should be
performed simultaneously for
effective optimization

 The behavior of a system to be synthesized is
usually specified at the algorithmic level using a
high-level programming language like C/C++ or a
hardware description language (HDL) such as
VHDL and Verilog.
 The behavior of the system is then compiled into
internal representations, which are usually data
flow graphs (DFGs) and control flow graphs
(CFGs).
 Each behavioral specification is transformed into
a unique graphical representation.
 The DFG is a directed graph that represents data
movement, whereas the CFG is a directed graph
that indicates the sequence of operations.

 In the transformation step, the initial DFG is
transformed so that the resultant DFG is more
suitable for scheduling and allocation.
 These transformations include compiler-like
optimizations such as dead-code elimination,
common sub-expression elimination, loop
unrolling, constant propagation and code
motion.
 In addition, some hardware-specific
transformations like minimization of syntactic
variances and retiming may be applied to take
advantage of the associativity and commutativity
of certain operations

 Scheduling is the process of partitioning the set of
arithmetic and logical operations in the DFG into
groups so that the operations in the same group can
be executed concurrently, while taking into
consideration possible trade-offs between the total
execution cost and hardware cost.
 A group of concurrent computations to be executed
simultaneously is referred to as a control step.
 The total number of control steps needed to execute
all operations in the DFG, the minimum number of
functional units of each type to be used in the design
and the lifetimes of the variables generated during
the computation of operations are determined in the
scheduling step.

 Selection is the process of choosing resources from the
library, which involves tradeoffs according to different
features like delay, area, power and leakage.
 Resource allocation is the process of determining the
number of functional units of each type for performing
operations, memory units (registers) for storing data
values and interconnects for data transportation.
 Often, the selection and allocation processes are a single
task.
 Allocation is further divided into sub-tasks, such as
functional unit allocation, memory unit allocation and
interconnect allocation.
 Resource allocation and binding may share resources so
that the same hardware can be used to execute different
operations or so that the same register can be used to
store more than one variable.

 Binding or assignment is the process of assigning
variables to memory units and data transfers to
interconnections.
 Binding is further divided into several sub-tasks, such
as functional unit binding, memory unit binding and
interconnect binding.
 Functional unit binding involves the mapping of
operations in the behavioral description into a set of
selected functional units.
 Memory unit binding maps data carriers (constants,
variables, arrays) in the behavioral description onto
storage elements (read-only memories, registers,
memory units) in the data path.
 The interconnect binding task maps every data
transfer in the behavior onto a set of interconnection
units for data routing.

 In the output generation phase, design
output is generated.
 The output should be in a form such that
logic-level synthesis tools can optimize the
combinational logic and layout synthesis
tools can design the chip geometry.
 The generated output is generally in a low-
level HDL, such as structural VHDL

 Data Path Synthesis
 Control Synthesis
 The controller is typically a finite state machine
that is either microcoded or hardwired

 HLS is important for several reasons
 Reduction of design cycle time
 Rapid design space exploration at the higher level of
abstraction
 Wrong decisions are not propagated to lower levels of design
abstraction,
 HLS involves several important steps, such as:
 Scheduling
 Allocation
 Binding
 Several graph theoretical algorithms are available that can
perform optimization while performing these tasks.
 Two Types
 Data path
 Control synthesis
 There are existing tools to perform high-level synthesis
explicitly, and some tools perform the behavioral to RTL
compilation as an intermediate process.

Introduction to SoC Design Methodology

 Design flow of
integrated circuits
 Application phase
 Implementation
phase
 Both are decoupled
 Application to
implementation
 A specification
document written by:
 Application team
 System architecture
specialist
 Ad-hoc and informal
approach

 Problems
 Ambiguity of the informal specification
document leads to misinterpretations and
implementation errors
 Lack of reliable performance information before
the implementation often causes an over- or
under-provisioning of processing and
communication resources
 Quality of results mainly depends on the intuition
and experience of the system architect
 Manual creation of the verification environment
requires significant effort and again represents a
potential source of inconsistencies with the
original design intend

 Electronic System Level
(ESL)
 Application is jointly
considered with the system
architecture to find a
feasible and cost effective
application to architecture
mapping
 The declared goal of ESL
design is to increase the
engineering productivity and
quality of results during the
specification of the MP-SoC
platform architecture and
application mapping

 New design paradigm to cope
with the:
 complexity
 economics
of the emerging billion-transistor
System-on-Chip era.
 Architecture centric definition
 We define platform-based design
as the creation of a stable
microprocessor-based architecture
that can be rapidly extended,
customized for a range of
applications, and delivered to
customers for quick deployment
 Design process based definition
 The general definition of a
platform is an abstraction layer in
the design flow that facilitates a
number of possible refinements
into a subsequent abstraction
layer in the design flow

 Multiple, almost orthogonal phases
 Functional phase
 Performed by application specialists
 Completely agnostic to architectural considerations.
 Includes
 Embedded SW development of the control-plane portion
 Data-plane algorithm development
 The latter is carried out using highly application domain specific tools and methodologies
 MP-SoC platform phase
 All designs tasks, which have to be performed under consideration of the full functional and
architectural complexity the MP-SoC platforms
 Example
 Specification of the system-architecture
 Mapping of the application onto the MP-SoC platform
 Development of the hardware dependant Software layers
 High-level IP creation phase
 Design of processing elements (RISC, DSP, MCU, ASIPs)
 On-chip interconnect technologies (busses, NoC),
 Somain specific standard I /O (PCI-variants, SPIx variants, HyperTransport, I2C, FireWire, QDR,
etc.),
 Creation of well defined ASIC IP blocks (e.g. an MPEG4 video codec).
 Not completely orthogonal to the functional phase, since the design of application specific
processing elements and communication IP indeed depends on the considered application
 Semiconductor technology and basic IP creation phase
 Covers standard cells, I/O, memories and the basic technology processes supporting them.
 More heterogeneous technologies, combining embedded DRAM, embedded Flash, mixed-signal
BiCMOS, RF, and analog
 More to do with fabrication technologies

 Represent the results of the
functional phase as a well
defined application model
as the Executable
Specification of the system
 System architecture needs
to be defined in terms of
mapping the application
model to the hardware
(Main Task)
 Embedded SW development
 Hardware-Software co-
verification task: RTL is
verified along with
embedded software
 Methodology used:
Transaction Level Modeling
(TLM)

 Engineering of integrated circuits has always employed
models on different levels of abstraction
 Model: unique, idealized description of the considered system
 Degree of abstraction characterizes the type of model used in
the respective design phase
 Goal of abstraction is to provide a description the system,
 which is simple enough
 yet sufficiently accurate to enable the necessary investigations
 take design decisions
 proceed to the next design phase.
 Indeed, the design-flow of an embedded system can be
considered as a sequence of steps which successively
reduce the degree of abstraction in the system model

 Functionality refers to the modeling of the
system behavior
 On the highest level of abstraction, the
functionality is condensed to pure
mathematic expressions.
 Later the functionality is refined to
operators,
 Finally mapped to logic gates
 Timing model captures the temporal
properties of the system
 Degree of abstraction ranges from causality
of events to physical timing of transistors
and wires
 Data representation
 Higher level data resolution is reduced to
Tokens and Abstract Data Types (ADT)
 Lower levels employ word or bit
representations.
 The Component granularity describes the
finest resolution of the sub-blocks
 First the component resolution is restricted
to coarse-grain building blocks,
 Finally the complete embedded system is
composed of fine-grain silicon transistors.

 Creation of a system model
requires:
 Modeling language
 Well defined execution semantic
coordinating the activation of the
individual blocks
 Model of Computation (MoC) is
composed of two parts:
 Coordination language describes
basic execution semantics with
respect to properties like
parallelism, synchronism,
reactivity and provides the
abstracted communication
mechanism
 The host language provides the
language elements for the
specification of the system
models

 Characterized by the total temporal ordering of
all occurring communication events
 Example is the discrete event simulation MoC,
which defines the execution semantics for HDL
simulators
 Further examples of timed MoCs are synchronous
languages like Esterel, Lustre, or Signal, where
the events of all communication signals are
constrained to occur at identical time stamps
 Thanks to their sound mathematical foundation,
synchronous languages have gained adoption for
the specification, analysis and code-generation
of reactive control-dominated applications

 Characterized by the fact, that communication
events are only partially ordered
 However, various untimed MoCs are popular for
the specification of both data and control
dominated applications
 Data-Flow MoCs are heavily employed for algorithmic
modeling and analysis of signal processing
applications
 Communicating Sequential Processes (CSP) and
Calculus for Communicating Systems (CCS) are
prominent untimed MoCs which are based on
sequential processes that communicate using a
rendezvous communication mechanism.

 The definition of a proper MoC has long been considered to
be the silver bullet for system level design and by that for
the solving of the design productivity crisis
 Initially, the complete system functionality is to be created
using the ideal MoC, which provides highest modeling
efficiency, simulation speed, and smooth IP reuse
 Next, the initial specification would be automatically
verified using formal verification technology and metrics
like determinism, causality, dead-lock absence,
consistency, completeness, and fairness. The golden
system specification would then provide the foundation for
an automated path to design space exploration to take
functional and architectural design decisions
 Finally, system level synthesis would be applied to the
partitioned system specification providing an automated
path to implementation.

 Object Oriented Programming (OOP) is a powerful abstraction
mechanism,
 Data and functionality is partitioned and encapsulated inside classe
 OOP based languages: UML,C++, or Java
 Widely adopted in engineering of arbitrary SW
 Gaining importance for the specification of embedded control-plane
processing
 OOP components interact primarily by sequentially transferring
control through method calls
 Sequential nature of OOP hinders the intuitive specification,
analysis and refinement of the inherent parallel data-plane
processing tasks

 For this purpose the actor-oriented abstraction
scheme has been conceived, where parallel
objects interact by sending and receiving
messages
 Within an actor-oriented design environment,
the designer can focus on the specification and
analysis of the algorithmic behavior of the
individual tasks whereas the communication and
synchronization aspects are handled by the
underlying parallel Model of Computation
 SystemC allows Actor Oriented Programming

 Actor-based design languages achieve high modularity in communication modeling by
using the Interface Method Call (IMC) principle
 IMC mechanism is realized by A set of language elements for
 Modules
 Ports
 Interfaces
 Channels.
 Processes modeling the behavior are wrapped into modules and access communication
services through ports
 Available methods are
 Declared in the interface specification
 Implemented by the channel
 Thus the access methods in an interface reflect the specialized properties of the
communication style implemented by an particular channel
 Actor-oriented design languages offers a generic Model of Computation, which in case of
SystemC is based on an event driven simulation kernel
 Channels serve as containers for communication and synchronization
 The user can extend the generic MoC by creating his own methodology specific channel
library

 Challenge of System Level Design
 The architecture definition and application
mapping have to be considered jointly by taking
the full functional and architectural complexity
into account
 In case of a fixed target platform, SLD is
reduced to the application mapping task,
which as a synonym term is also called the
partitioning of the application

 Orthogonalization of concerns with respect to all modeling attributes
generally enables a divide-and-conquer approach to System Level
Design
 Separation of interfaces and behavior according to the interface
based design paradigm fosters successive communication and structural
refinement as well as IP reuse
 High modeling efficiency and simulation speed is mandatory to
handle the high complexity of SoC designs
 Incorporation of hardware specific concepts like timing, reactivity,
parallelism, and determinism to express the impact of the platform
architecture
 Incorporation of software specific concepts like Object Oriented
Programming, Operating System (OS) encapsulation, Inter Process
Communication (IPC), process concurrency, as well as the creation,
mutual preemption, and termination of processes to enable smooth
integration of the embedded Software part.
 Support for Verification and Validation verification, to first gain
evidence on the highest possible level of abstraction, that the correct
system is being developed and all performance and cost requirements
are met (validation). Later, the validated specification should be reused
as a golden reference model for the subsequent refinement, IP
integration and implementation steps (verification).
 Seamless transition between design phases and abstraction levels
from system to gates to avoid long iteration cycles caused by gaps in
the design flow.

Question Remains - - - How to do it???

More design aspects

 HW/SW Co-simulation has been recognized as a
necessary ingredient for HW/SW Co-design.
 First HW/SW Co-simulation prototypes linked
Hardware Description Language (HDL) simulators to
an ISS (Instruction Set Simulators) executing the
Software part.
 Soon, HDL/ISS Co-simulation environments like
became commercially available and are still idely
employed.
 This HDL/ISS approach is severely limited by the slow
simulation speed of the HDL simulator, especially in
case of large systems with several ISSes and
significant hardware portions.
 The concept of flexible hardware abstraction levels
has been developed,
 Here accuracy can be traded against simulation
speed.

 Maximum simulation speed can be achieved
by using compiled ISS technology together
with highly abstract functional SystemC
models of the hardware part

 The original goal of HW/SW Co-design was to reach the same
degree of tool automation known from RTL synthesis, i.e. a
formalized system specification is automatically partitioned and
synthesized to the optimal target architecture
 automated HW/SW partitioning and System Synthesis have never
gained industrial relevance
 Partitioning decision metric is restricted to worst case execution time,
 Other important metrics like average performance, cost, and power
dissipation are not taken into account.
 Even the worst case execution time proved to be hard to estimate in
the general case of parallel, data dependent, and interleaved software
execution
 HW/SW partitioning and automated synthesis is still not
recognized as a dominant issue
 system architects are interested in the impact on performance of
a specific target architecture
 To partly automate this mapping,
 Communication Synthesis
 HW/SW Interface Synthesis
emerged as new branches of HW/SW Co-design

 Techniques for the analysis of communication requirements and synthesis
of the communication architecture
 As of today, Communication Analysis and Synthesis techniques need
further advancement to cope with emerging Network-on-Chip
architectures.
 One attempt is to instantiate the NoC library elements (routers, network
interfaces, links) from a high-level view of the SoC floorplan
 Selection of the actual library elements can be in different ways:
 In a application-centric approach, the network topology can be generated from a
communication graph of the application
 In an architecture-centric approach, the communication architecture can be
refined from an abstract channel view via a network topology view towards a
micro-architecture view .
 So far the analysis of Network on Chip architectures is performed using
handcrafted simulation models, which are mostly based on SystemC
 The absence of standardized APIs, abstraction levels and modeling
frameworks beyond the plain SystemC language so far hinders the
creation of interoperable IP models for NoC architectures.
 Some of the current projects working on a unified modeling environment
for the exploration of NoC architectures are discussed in section 5.3.3
below.

 Here, the designer decides on the
partitioning and architecture mapping
 The realization of these decisions are
supported by automating the tedious task of
generating the required Software driver
functions as well as the Hardware glue-logic
 Recently the technology has been ported to
the SystemC

 MP-SoC platform phase is
concerned with:
 System architecture specification
 Application mapping
 Abstraction concepts on this level
have to support the joint
consideration of application and
architecture
 High level of detail inherent to
Register Transfer Level (RTL)
implementation models prohibits
the investigation and optimization
across heterogeneous
communication and processing
elements
 Significant research has been spent
on the definition of the
appropriate System Level Design
language.
 Today SystemC is generally
considered as the standard
language for all kinds of SLD tasks.

 SystemC has initially been conceived to replace VHDL and
Verilog as a Hardware Description Language
 For this reason it naturally provides all hardware specific
concepts e.g., time, parallelism, and hierarchy
 With version 2.0 SystemC has been thoroughly revised to
become a fully elaborated actor oriented design language
 The incorporated Interface Method Call (IMC) principle
enables a clean separation of interfaces and behavior as
well as orthogonalization of further modeling attributes
 All kinds of methodology and application domain specific
Models of Computation (MoC) can be implemented on top
of the generic event-driven SystemC simulator
 SystemC 2.0 enables a smooth transition from functional
phase to the MP-SoC platform phase, e.g. hybrid
simulation of an architecture model in the context of an
algorithmic Data-Flow model

 Since SystemC is a native C++ library, it
inherently supports Object Oriented
Programming
 Final version 2.1 of the language has become
an official IEEE standard
 Development of the Transaction Level
Modeling (TLM) kit
 Synthesizable subset of SystemC

 The characteristic property of TLM:
 Pin-level communication interface of RTL models
replaced by a set of interface methods.
 This IMC based communication mechanism is
provided by all actor-oriented specification
languages

 SystemC based TLM has demonstrated the potential
in terms of increased simulation speed and modeling
efficiency
 The basic TLM API consist of a bidirectional transport
and a set of unidirectional put and get interfaces
 The bidirectional transport has blocking
synchronization
 Implementation of the interface is allowed to call
wait(.)
 The unidirectional interfaces are available in a
blocking and a non-blocking version
 These interfaces can be seen a foundation layer for
the creation of more advanced TLM interfaces, which
serve a specific methodology or model a specific
communication protocol

 The two cycle-level TLM
layers
 Bus Accurate (BA)
 Cycle Callable (CC)
 These levels are
particularly suitable to
create a cycle-accurate
prototype of the system
architecture
 The (usually cycle-
accurate) Instruction Set
Simulators (ISS) of the
programmable
architectures are
connected to cycle- and
bit-accurate models of
memories, communication
resources and peripherals

 BA and CC difference:
 BA captures a transaction
within a single method call,
 CC models provide separate
methods for every phase of
a transaction.
 The Programmer’s View
(PV) abstraction levels
address early integration
of (usually instruction
accurate) ISSes for SW
development purposes
 PV provides a bit and
address-map accurate view
of the MP-SoC architecture
context for the
programmable processing
elements
 PV is based on the
bidirectional blocking
transport API

 The Open Core Protocol International Partnership
(OCP-IP) is getting a lot of traction throughout the
industry
 OCPIP provides a high configurable SoC protocol and
their System Level Design working group has worked
from the early days on Transaction Level Modeling

 Lowest level: Transaction Layer 1 (TL1)
 provides a fully cycle accurate model of the OCP protocol
 Fully aligned with the CC abstraction level from OSCI.
 Next higher level: Transaction Layer 2 (TL2)
 Represents basically a cycle-approximate abstraction of the OCP protocol.
 The API contains a large number of OCP specific features
 like e.g. thread-busy, handshaketiming, or sideband signals.
 The timing is not cycle accurate, but can be annotated to a near-cycle accurate level
 Highest Level: Transaction Layer 3 (TL3)
 protocol agnostic subset of TL2
 API is limited to a concise set of primitives,
 Model timing approximate on-chip communication

 PV TLM platforms for early SW development as well
as cycle-level TLM for HW/SW and TLM/RTL co-
verification are successfully deployed throughout the
industry
 However, both use-cases solve only parts of the
challenges during the MP-SoC design phase
 Especially the architecture definition and task
partitioning is not adequately addressed
 PV platforms simulate very fast and are well suited
for SW development
 Unfortunately they do not contain sufficient timing
information for architectural investigations
 The blocking semantics of the underlying
bidirectional transport API hinders the smooth
annotation of further timing information

 Cycle accurate models of the SoC platform are too
detailed and too slow for architecture definition and task
partitioning
 First, the effort to create such a cycle-accurate model of
the complete platform is way too high to allow for the
investigation of a large number of architecture and
application mapping alternatives
 Second the reachable simulation speed in the order of
100k cycles per second is not sufficient for the analysis of
large design parameter choices
 As a result, the exploration of broad design spaces is still a
cumbersome process in cycle-level TLM based design flows
 Cycle-level TLM communication models have architecture
specific interfaces.
 Thus, every time the designer is inclined to explore a new
communication architecture he has to change the
interface of the connected functional models

 For this reason the Design Space Exploration framework
deploys a generic synchronization interface, which
provides the same primitives as the newly standardized
OCP TL3 API
 Obviously, the TL3 API presents the best fit for this purpose
 It is compliant with the OSCI TLM standard
 Additionally, it is of reasonable complexity, and yet offers
sufficient expressiveness to meet the accuracy
requirements for design space exploration
 By deploying SystemC based Transaction Level Modeling
the framework is nicely integrated into the flourishing ESL
ecosystem.
 This method is interoperable with the PV and cycle-
accurate modeling methodologies and can benefit from the
commercial tool support, available IP models, and
established ESL design methodologies

 Component Based Design
 Ffounded on the assumption, that the processing elements and
communication templates are available IP blocks
 Communication Based Design:
 envisions MP-SoC platform design as a composition of reusable
IP blocks
 Different from Component Based Design
 Omits the consideration of processing elements
 Is exclusively focused on the conceptualization and
implementation of the communication architecture.
 Communication Based Design can be seen as the corresponding
design paradigm to match emerging NoC architectures.
 Design Space Exploration (DSE) Environment
 The goal is to take early design decisions with respect to
system architecture and application mapping on the basis of an
abstract performance model.
 The embedded application needs to be modeled together with
the MP-SoC architecture at a high level of abstraction

Introduction to SoC
Design Space Exploration (DSE)
Methodology

 Ultimate goal is to meet the System Level Design
requirements as specified and to cope with the
full architectural complexity of emerging MP-SoC
architectures

 MP-SoC Framework follows the y-chart
principle
 Set of functional application models is
merged with a set of architecture
models in a dedicated mapping step
 Developed embodiment of the y-chart
principle is called Virtual Architecture
Mapping (VAM) which comprises of:
 Well defined abstraction level above
cycle-level TLM for efficient modeling
of embedded applications
 Set of generic, parameterizable
architecture models, which capture the
notion of shared and resource limited
architectural fabrics for communication
and computation
 Rigorous definition of a timing model,
that embodies the performance of a
selected application-architecture-
mapping
 MP-SoC simulation framework featuring
a declarative mapping mechanism to
minimize turn-around times during the
iterative architecture exploration cycle
 Comprehensive set of analysis tools for
functional and performance validation

Soc - Intro, Design Aspects, HLS, TLM

More Related Content

What's hot

Similar to Soc - Intro, Design Aspects, HLS, TLM

Soc - Intro, Design Aspects, HLS, TLM