Delivered by:
Subhash Iyer,
Program Head,
Soft Polynomials (I) Pvt. Ltd., Nagpur
(CDAC ATC)
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 1
2Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 Introduction
 What is SoC ?
 SoC characteristics
 Benefits and drawbacks
 Solution
 Major SoC Applications
 Summary
3Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 Technological Advances
 today’s chip can contains 100M transistors
 transistor gate lengths are now in term of nano
meters
 approximately every 18 months the number of
transistors on a chip doubles – Moore’s law
 The Consequences
 components connected on a Printed Circuit Board
can now be integrated onto single chip
 hence the development of System-On-Chip design
4Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
System on a board
System on a Chip
5Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
6Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
7Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 Introduction
 What is SoC ?
 SoC characteristics
 Benefits and drawbacks
 Solution
 Major SoC Applications
 Summary
8Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
Version A:
The VLSI manufacturing technology advances has
made possible to put millions of transistors on a
single die. It enables designers to put systems-on-a-
chip that move everything from the board onto the
chip eventually.
Version B:
SoC is a high performance microprocessor, since we
can program and give instruction to the uP to do
whatever you want to do.
Version C:
SoC is the efforts to integrate heterogeneous or
different types of silicon IPs on to the same chip,
like memory, uP, random logics, and analog circuitry.
All of the above are partially right, but not very
accurate!!!
9Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
• SoC not only chip, but more on “system”.
• SoC = Chip + Software + Integration
• The SoC chip includes:
• Embedded processor
• ASIC Logics and analog circuitry
• Embedded memory
• The SoC Software includes:
• OS, compiler, simulator, firmware, driver, protocol
stack
• Integrated development environment (debugger, linker,
ICE)
• Application interface (C/C++, assembly)
• The SoC Integration includes :
• The whole system solution
• Manufacture consultant
• Technical Supporting
10Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 A typical digital system design involves a significant
amount of custom logic circuitry, but also includes
pre-designed major components, such as processors,
memory units and various types of input/output (I/O)
interfaces.
 In the traditional approach for designing such
systems, a new integrated circuit (IC) chip is created
for the custom logic circuits, but each pre-designed
component is included as a separate chip
 Different approach for realizing digital systems,
called embedded system design. It leverages the
advanced capabilities of today's IC technology by
implementing many of the components of the system
within a single chip, such as a field programmable
gate array (FPGA).
11Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 Offer large logic capacity, exceeding several
million equivalent logic gates, and include
dedicated memory resources
 Include special hardware circuitry that is
often needed in digital systems, such as
digital signal processing (DSP) blocks (with
multiply and accumulate functionality) and
phase-locked loops (PLLs) (or delay-locked
loops (DLLs)) that support complex clocking
schemes
 Support a wide range of interconnection
standards, such as double data rate (DDR
SRAM) memory, PCI and high-speed serial
protocols.
12Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
13Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 Introduction
 What is SoC ?
 SoC characteristics
 Benefits and drawbacks
 Solution
 Major SoC Applications
 Summary
14Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
Top Level Design
Unit Block Design
Integration and Synthesis
Trial Netlists
System Level Verification
Timing Convergence
& Verification
Fabrication
DVT
DVT Prep
6 12 12 4 14 ?? 5 8 Time in Weeks
Time to Mask order48
61
Unit Block Verification
ASIC Typical Design Steps • Typical ASIC design
can take up to two
years to complete
15Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
Top Level Design
Unit Block Design
Integration and Synthesis
Trial Netlists
System Level Verification
Timing Convergence
& Verification
Fabrication
DVT
DVT Prep
4 14 5 4
Time in Weeks
Time to Mask order24
33
Unit Block Verification
4 2
• With increasing Complexity of
IC’s and decreasing Geometry, IC
Vendor steps of Placement,
Layout and Fabrication are
unlikely to be greatly reduced
• In fact there is a greater risk
that Timing Convergence steps
will involve more iteration.
• Need to reduce time before
Vendor Steps.
• Need to consider Layout issues
up-front.
SoC Typical Design Steps
16Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 Design reuse is facilitated if “standard”
internal connection buses are used .
 All cores connect to the bus via a standard
interface .
 Any-to-any connections easy but …
 Not all connections are necessary .
 Global clocking scheme .
 Power consumption .
 Standardization is being addressed by the
Virtual Socket Interface Alliance (VSIA)
17Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
• AMBA (Advanced Microcontroller Bus Architecture)
is a collection of buses from ARM for satisfying a
range of different criteria.
• APB (Advanced Peripheral Bus): simple strobed-
access bus with minimal interface complexity.
Suitable for hosting peripherals.
• ASB (Advanced System Bus): a multimaster
synchronous system bus.
• AHB (Advanced High Performance Bus): a high-
throughput synchronous system backbone. Burst
transfers and split transactions.
18Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
• One solution to the design productivity
gap is to make ASIC designs more
standardized by reusing segments of
previously manufactured chips.
• These segments are known as “blocks”,
“macros”, “cores” or “cells”.
• The blocks can either be developed in-
house or licensed from an IP company.
• Cores are the basic building blocks .
19Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
• Soft Macro
– Reusable synthesizable RTL or netlist of generic library elements
– User of the core is responsible for the implementation and layout
• Firm Macro
– Structurally and topologically optimized for performance and area
through floor planning and placement
– Exist as synthesized code or as a netlist of generic library elements
• Hard Macro
– Reusable blocks optimized for performance, power, size and
mapped to a specific process technology
– Exist as fully placed and routed netlist and as a fixed layout such
as in GDSII format .
20Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
Reusability
portability
flexibility
Predictability, performance, time to market
Soft
core
Firm
core
Hard
core
21Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
• Locating the required cores and associated
contract discussions can be a lengthy
process
– Identification of IP vendors
– Evaluation criteria
– Comparative evaluation exercise
– Choice of core
– Contract negotiations
• Reuse restrictions
• Costs: license, royalty, tool costs
– Core integration, simulation and verification
22Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
23Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 MPSoC is a system-on-chip that contains multiple
instruction-set processors (CPUs).
 The typical MPSoC is a heterogeneous
multiprocessor: there may be several different
types of processing elements (PEs), the
memory system may be heterogeneously
distributed around the machine, and the
interconnection network between the PEs and
the memory may also be heterogeneous.
 MPSoCs often require large amounts of
memory. The device may have embedded
memory on-chip as well as relying on off-chip
commodity memory.
24Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 These chips
have:
• one (several)
processors
• large amounts of
memory
• bus-based
architectures
• peripherals
• coprocessors
• and I/O channels
25Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 Introduction
 What is SoC ?
 SoC characteristics
 Benefits and drawbacks
 Solution
 Major SoC Applications
 Summary
26Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
• There are several benefits in integrating a
large digital system into a single integrated
circuit .
• These include
– Lower cost per gate .
– Lower power consumption .
– Faster circuit operation .
– More reliable implementation .
– Smaller physical size .
– Greater design security .
27Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
• The principle drawbacks of SoC design
are associated with the design pressures
imposed on today’s engineers , such as :
– Time-to-market demands .
– Exponential fabrication cost .
– Increased system complexity .
– Increased verification requirements .
28Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 Why does it take longer to design SOCs compared to
traditional ASICs?
 We must examine factors influencing the degree
of difficulty and Turn Around Time (TAT) (the time
taken from gate-level netlist to metal mask-ready
stage) for designing ASICs and SOCs.
 For an ASIC, the following factors influence TAT:
• Frequency of the design
• Number of clock domains
• Number of gates
• Density
• Number of blocks and sub-blocks
 The key factor that influences TAT for SOCs is system
integration (integrating different silicon IPs on the
same IC).
29Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 Introduction
 What is SoC ?
 SoC characteristics
 Benefits and drawbacks
 Solution
 Major SoC Applications
 Summary
30Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
• Overcome complexity and verification issues by
designing Intellectual Property (IP) to be re-
usable .
• Done on such a scale that a new industry has been
developed.
• Design activity is split into two groups:
– IP Authors – producers .
– IP Integrators – consumers .
• IP Authors produce fully verified IP libraries
– Thus making overall verification task more
manageable
• IP Integrators select, evaluate, integrate IP from
multiple vendors
– IP integrated onto Integration Platform designed
with specific application in mind
31Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
32Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
IP cores are classified into three
distinct categories:
 Hard IP Cores
 Firm IP Cores
 Soft IP Cores
33Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
Hard IP cores consist of hard layouts
using particular physical design libraries
and are deliverid in masked-level
designed blocks (GDSII format). The
integration of hard IP cores is quite
simple, but hard cores are technology
dependent and provide minimum
flexibility and portability in
reconfiguration and integration.
34Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
Soft IP cores are delivered as RTL
VHDL/Verilog code to provide functional
descriptions of IPs. These cores offer
maximum flexibility and reconfigurability
to match the requirements of a specific
design application, but they must be
synthesized, optimized, and verified by
their user before integration into designs.
35Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
Firm IP cores bring the best of both
worlds and balance the high performance
and optimization properties of hard IPs
with the flexibility of soft IPs.These cores
are delivered in form of targeted netlists
to specific physical libraries after going
through synthesis without performing the
physical layout.
36Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
Resusability
portability
flexibility
Predictability, performance, time to market
Soft
core
Firm
core
Hard
core
37Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
38Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 Introduction
 What is SoC ?
 SoC characteristics
 Benefits and drawbacks
 Solution
 Major SoC Applications
 Summary
39Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
eS/W: Current application complexity
 Set-top box: >1 million lines of code
 Digital audio processing: >1 million lines of code
 Recordable DVD: Over 100 person-years effort
 Hard-disk drive: Over 100 person-years effort
In multimedia systems
 S/W cost (licenses) 6X larger than H/W chip cost
 eS/W uses 50% to 80% of design resources
 eS/W now an essential part of SoC products
40Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 Speech Signal Processing .
 Image and Video Signal Processing .
 Information Technologies
 PC interface (USB, PCI,PCI-Express, IDE,..etc)
Computer peripheries (printer control, LCD
monitor controller, DVD controller,.etc) .
 Data Communication
 Wireline Communication: 10/100 Based-T, xDSL,
Gigabit Ethernet,.. Etc
 Wireless communication: BlueTooth, WLAN,
2G/3G/4G, WiMax, UWB, …,etc
41Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
• Consumer devices,
• Networking,
• Communications, and
• other segments of the electronics industry.
microprocessor, media processor,
GPS controllers, cellular phones,
GSM phones, smart pager ASICs,
digital television, video games,
PC-on-a-chip
42Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
43Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
44Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
45Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
Systems on chip are everywhere
Technology advances enable increasingly more complex designs
Central Question: how to exploit deep-submicron
technologies efficiently?
46Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 Introduction
 What is SoC ?
 SoC characteristics
 Benefits and drawbacks
 Solution
 Major SoC Applications
 Summary
47Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 Technological advances mean that complete
systems can now be implemented on a single
chip .
 The benefits that this brings are significant in
terms of speed , area and power .
 The drawbacks are that these systems are
extremely complex requiring amounts of
verification .
 The solution is to design and verify re-
useable IP .
48Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 49
Delivered by:
Subhash Iyer,
Program Head,
Soft Polynomials (I) Pvt. Ltd., Nagpur
(CDAC ATC)
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 1
Introduction to SoC Design Aspects
2Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 At each level of circuit abstraction, the circuit is equivalent and
performs the same target operation, but its structural
components (and hence the component’s granularity) are
different, and the design issues may be different
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 3
 Embedded applications in multimedia,
wireless communications or
networking domain were implemented
on Printed Circuit Boards (PCBs).
 Composed of discrete Integrated
Circuits (ICs)
 General Purpose Processors
 Digital Signal Processors
 Application Specific Integrated Circuits
 Memories
 Further peripherals.
 Communication between discrete
processing elements and memories is
realized by shared bus architectures
(like PCi Express)
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 4
 The transition is from board level integration towards System-on-
Chip (SoC) implementations of embedded applications.
 Today multiple heterogeneous processing elements and memories
can be integrated on a single chip
 Increased performance
 Reduced cost
 Improved energy efficiency
 This trend originates from tremendous increase in features as
well as the multitude of co-existing standards.
 Resulting functional complexity clearly promotes Software
enabled solutions to achieve the required flexibility and cope
with the demanding time-to-market conditions.
 However, stringent energy efficiency constraints of mobile
applications and cost sensitive consumer devices prohibit the use
of general purpose processors.
 Tight cost and performance requirements of versatile embedded
systems lead to application specific heterogeneous multi-
processor architectures
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 5
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 6
 Classical vertical partitioning approach to HW/SW Codesign, where the
performance critical parts are implemented as dedicated HW blocks and
the rest is executed in SW, is no longer applicable.
 Instead HW/SW Co-design can be seen as:
 Multi-dimensional horizontal mapping problem of an application running on a
heterogeneous multiprocessor platform.
 During the mapping process,
 Exploit application inherent parallelism to achieve performance at reasonable
cost.
 For the computationally intensive portions of typical embedded
applications the extraction of Task Level Parallelism (TLP) is mostly
straight forward:
 The partitioning into a set of loosely coupled functional blocks can be
naturally derived from the algorithmic block diagram
 Two major aspects
 Processing : A set of processing elements has to
be provided for the efficient execution of the
functional tasks.
 Communication mapping: The inter-task data
exchange has to be mapped to a communication
architecture.
 Only a joint consideration of architectural
choices in both areas bears the opportunity
for near optimal quality of results.
 Recent architectural advances offer a huge
design space with enormous potential for
optimization
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 7
 Bus paradigm as inherited from the PCB era
constitutes the major power and performance
bottleneck.
 Chip-wide communication is envisioned to be handled
by full-scale Network-on-Chip (NoC) architectures.
 Network-on-Chip architectures
 Resolve the physical issues
 Address the functional aspects of on-chip
communication.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 8
 So far, the dynamic priority based arbitration scheme of shared busses
creates a mutual dependency between all components connected to the
bus.
 Due to this lack of traffic management capabilities every change in the
traffic requirements of the application requires a re-design of the bus
architecture.
 Instead, NoC architectures take advantage of sophisticated networking
algorithms to provide elaborated traffic-management capabilities.
 By that, the ad-hoc communication mapping is replaced with a
disciplined allocation of the required communication services and the
on-chip network takes care to provide the required resources.
 From the system architecture perspective, this separation of the
offered communication services from the architectural resources can be
considered as a virtualization of the actual communication
architecture.
 This virtualization effectively decouples the mapping problem for
communication and computation.
 The price to pay for the physical and functional benefits of NoC based
communication is a significant penalty in terms of chip area as well as
transfer latency.Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 9
 Programmable processing
elements achieve significant
gains with respect to
performance and computational
efficiency by:
 tailoring instruction set
 micro architecture to the
respective set of tasks
 Examples are innovative
architectures exploiting
 Instruction Level Parallelism (ILP)
 Data Level Parallelism (DLP)
 Despite the increased
computational performance,
the effective performance is
often constricted by the
communication architecture,
since memory accesses latency
does not keep pace with the
processing power.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 10
 General purpose processors resolve the memory access
bottleneck by using sophisticated cache and memory hierarchies.
 This is generally not applicable for embedded applications due to
the poor memory locality of stream driven and packet based data
processing.
 Instead, processor architectures are equipped with hardware
supported Multi-Threading (HW-MT) to perform task switches
with virtually no performance overhead.
 By that, the application inherent TLP is exploited with the
purpose of hiding memory latency, which effectively leads to a
significant increase in the processor utilization.
 This technique is already widely employed in the network
processor domain but recently finds its way into advanced
multimedia and signal processing platforms.
 In the light of the latency issue caused by NoC architectures, the
importance of memory hiding techniques is likely to increase in
the future.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 11
 Taking the above considerations together,
future SoCs can be considered as
 NoC enabled multi-processor architectures.
 On-chip communication backbone connects a
large number of heterogeneous processing
clusters and global storage elements.
 Individual processing clusters consist of one
or few application specific programmable
kernels together with tightly coupled
instruction and data memories as well as
local peripherals.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 12
 To cope with the resulting design complexity:
 Achieve virtualization of the architectural resources,
 They can be allocated by the system architect in a deterministic way.
 This virtualization is provided by
 NoC approach for communication part
 SW and HW operating systems for the control and data processing
respectively.
 Divide-and-conquer oriented design paradigm
 Enables individual optimization of the architectural elements
 The price for these benefits
 A penalty in terms of chip area,
 Generally considered to be of constantly decreasing importance.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 13
 HW/SW Co-design of a given embedded
application is defined to
 Architect a heterogeneous MP-SoC platform
 Allocate the architectural resources for the
execution of the application.
 Architecture virtualization resolves the
mutual dependencies in the mapping process
 Trade-offs in the design space still require a
joint consideration of application and
architecture as well as communication
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 14
 For example:
 Latency of a more complex on-chip network
can be compensated by either:
 introducing memory hierarchy
 employing hardware multi-threaded processor
kernels.
 Obviously, the resulting design space is
virtually infinite
 Architecting and the mapping phase cannot
be considered independently without
sacrificing quality of results.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 15
 What is needed is:
 A system level design methodology
 Corresponding tool supported modeling
framework
 Transaction-Level Modeling (TLM)
 Advocated by the SystemC language
 The system level design paradigm
 Already incorporated into state-of-the-art
Electronic System Level (ESL) tools
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 16
 TLM greatly improves
 modeling efficiency
 simulation speed
 Abstracts from
 Low-level communication
details of the Register
Transfer Level (RTL),
 To complete transaction
 Is usually employed in a
byte and cycle accurate
fashion
 We will look more at
packet-level TLM paradigm
 Cycle-level TLM is still too
detailed to explore large
design spaces.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 17
 Since communication becomes the driving design paradigm
for MP-SoC
 Exploration framework is based on a sophisticated,
communication centric timing model:
 Generic synchronization interface
 Defines a concise set of communication primitives,
 Follows the Open Core Open Core Protocol (OCP)
 Not biased towards any specific communication architecture.
 Additionally the primitives incorporate timing-annotation to
achieve reasonable timing accuracy at the highly abstract
packet-level TLM layer
 The communication timing model captures the impact on
performance of the interconnection architecture.
 This communication timing model supports the full
spectrum of available and proposed communication
architectures ranging from today’s shared busses to the
emerging NoC paradigm.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 18
 Implemented by means of a versatile
modeling framework for architecture
exploration and hardware/software
partitioning
 Key advantages:
 Modeling efficiency
 Higher simulation speed
 A declarative specification mechanism for better
design space exploration
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 19
 TLM is a method used for SoC Design
 To specify at a higher level of abstraction
 Involves Communication and Computation
Architectures
 Unified Timing Model aims to standardize the
TLM approach
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 20
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 21
Need to know why before what & how!!!
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 22
 Networking Domain
 Multimedia Domain
 Wireless Communications
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 23
 Constitutes
implementation of
networking standards
 IEEE, ITU, ETSI, etc work
out communication
standards
 The purpose of these
standards to achieve a high
degree of interoperability
 ISO/OSI reference model
has been providing a
common terminology
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 24
 Networking layer standards in the middle of
the ISO/OSI stack address a multitude of
higher layer application standards as well as
lower physical/link layer standards
 Major implementation challenge and effort
is of the networking layer
 Layer three multi-service access switches
are considered as one of the potential killer
applications for MP-SoC platforms, since
they combine the physical wire speed
throughput requirements with flexibility
constraints imposed by the individual
treatment of different service classes and
application characteristics.
 Today’s de facto networking layer standard
is given by the rather simplistic Internet
Protocol (IP).
 Lower level layers are nowadays built in as
ready made blocks
 Physical & link layer data rates of core
network equipment are imposing demanding
performance requirements
 Higher application layers are only present in
the terminal devices,
 So the relatively low to medium throughput
requirements allow for a software
implementation of the flexible and control
dominated functionality.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 25
 Processing of all kinds of media data
 Pictures
 Audio
 Video decoding
 Video pixel processing
 2D/3D graphics
 Standards enable the exchange of media data as
well as device inter-operability
 MOPS: Mega Operations Per second
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 26
 Advances in processing capabilities and multimedia
algorithms together with increased user expectations fuels
a constant proliferation of new multimedia standards
 Digital audio decoding (AC3, OGG, MP3),
 Video decoding (MPEG2, MEPEG4, H.263, H.264, DivX,
quicktime)
 3D graphic processing (DirectX 9)
 Apart from the multitude and dynamics of multimedia
standards, a flexible implementation platform is also
mandatory to meet demanding cost constraints of
converging consumer electronics devices such as the
Advanced Set-Top Box (ASTB).
 Here the processing and communication fabrics have to be
shared among the multitude of supported multimedia
applications to limit implementation cost.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 27
 Wireless communication applications aggressively use digital
signal processing to maximize bandwidth efficiency
 Again, a multitude of standards exists
 Each marks a local optimum in
 implementation cost
 Mobility
 power dissipation
 performance bandwidth efficiency
 Multimedia and wireless communication domains are converging
into a new generation of Personal Digital Assistant (PDA) or
SmartPhone devices
 PDAs have started to support a huge variety of travel and fun
related applications with much higher processing requirements,
like e.g. localization, navigation, travel assistant, video camera,
digital camera, picture editing, MP3 player or games
 Additionally, this kind of portable, multimedia enabled PDA
devices are obliged to support multiple communication
standards, both cable (USB, FireWire) and wireless (3G, WLAN).
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 28
 Summary of common trends:
 New features and value added
services: lead to exponentially
increasing processing performance
and communication requirements.
 Standards become more dynamic and
sophisticated and are introduced more
rapidly: calls for high flexibility of the
SoC implementation to meet the
resulting time-in-market as well as
time-in-market requirements.
 For mobile applications and cost
sensitive consumer electronic devices:
energy efficiency becomes the
prevailing cost factor
 Heterogeneous Multi-Processor SoC
(MP-SoC) platforms are generally
believed to meet the above
mentioned conflicting performance,
flexibility and energy efficiency
requirements of demanding
embedded applications
 Hence, in the course of an MP-SoC
platform design the partitioning of
a specific application is a task of
major importance
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 29
 Main Partitioning Principle
 Control dominated domain
 Data dominated domain
 This first order partitioning has major
influence on both the target processing and
communication elements as well as on the
appropriate design methodology.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 30
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 31
 Examples
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 32
 Control-plane processing is characterized by:
 Moderate performance requirements,
 Huge amounts of functionality
 Calling for maximum flexibility
 Developed using an
 Integrated Design Environment (IDE) which is
 Architecture agnostic
 Software centric
 Software engineering techniques
 Object Oriented Programming (OOP) using
 Unified Modeling Language (UML)
 C++
 Java
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 33
 To increase the reuse of the control plane
Software (across multiple MP-SoC platform
generations):
 Hardware dependant Software (HdS) portions are
wrapped into:
 stack of middleware
 Real Time Operating System (RTOS)
 device driver layers
 Parallelism in Control Plane Processing:
 Instruction Level Parallelism (ILP)
 Extracted by a VLIW compiler
 Or a superscalar processor architecture
 Helps gain performance
 Task Level Parallelism
 Generally not possible due to huge amount of
functionality
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 34
 Data-plane processing is characterized by:
 Computationally intensive data manipulations
 Performance at high data rates
 Demand for high processing
 Demand for high communication performance.
 Rapidly evolving standards in all application
domains impose increasing flexibility
constraints.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 35
 Need to reach performance requirements of networking,
multimedia and wireless communications applications
 Requires aggressively exploiting abundant inherent
parallelism available in data-plane processing tasks
because:
 Functionality can be straightforwardly partitioned into a set of
loosely coupled tasks with well predictable or even cyclo-
stationary execution timing
 A well confined data set is associated with a single activation
of an individual task.
 Data sets associated with successive activations of an
individual tasks are mostly independent.
 These spatial and temporal properties with respect to
second order task partitioning and data dependency can
already be identified during the algorithm development
stage and lead to an identification of coarse grain TLP.
 This application inherent TLP enables the concurrent and
parallel execution on MP-SoC platforms.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 36
More about SoC design concepts next !!!
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 37
The mains aspects of
SoC architectural elements
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 38
 Macroscopic metrics for the classification
and evaluation of architectural elements
 Cost
 Performance
 Power Dissipation
 Computational Efficiency
 Flexibility
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 39
 Cost of embedded architecture is separated into
 Non Recurrent Engineering (NRE) cost for the initial design
 Recurring chip fabrication cost.
 NRE costs factor is caused by the
 Design effort for HW
 SW development
 Fabrication of the initial mask set.
 Typical NRE cost for 90 nm SoC
 10-100 Million USD design effort
 1 Million USD per mask set
 Fabrication cost determined by
 Silicon die area
 Packaging
 Number of pins
 Power dissipation requirements
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 40
 Performance of both computational
and communication architectures is
classified into:
 Latency
 Throughput
 Latency
 Absolute time passing between the
start and completion of a task,
 Throughput
 Number of accomplished tasks per
time.
 Communication throughput is
measured in bits per second (bps).
 Throughput of programmable
processing elements is measured in
Millions Instructions Per Second
(MIPS)
 MIPS measurement is not very
accurate
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 41
 Measured in Watt
 Denotes the energy per time required to operate
an embedded system
 Is an architecture metric of growing importance
 Battery lifetime of mobile devices immediately
depends on the energy consumption.
 Packaging cost depends on the heat dissipation
properties, which in turn depends on the power
consumption.
 Striving for low power and energy consumption
constitutes the key driver for architecture
differentiation of embedded SoC platforms
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 42
 Derived from performance and power
consumption
 Characterizes efficiency of a given
architectural element with a single value
 Computational efficiency of programmable
architectures is predominantly measured in
MIPS/Watt.
 Alternatively measured in energy
consumption per task (since MIPS
measurement is not very accurate)
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 43
 Related to the effort to change the
functionality of a given architectural
element
 In contrast to the previous metrics, flexibility
can be hardly measured in an accurate way.
 Nonetheless, in the context of rapidly
evolving functionality and standards of
embedded applications, architectural
flexibility is of major importance to achieve
both decreasing time-to-market as well as
increasing time-in-market
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 44
 A processing element (PE) provides the computational
resource to execute a given portion of the application
 Dedicated hardware implementation yields best
performance
 Programmable PEs are controlled by an instruction
stream in a highly flexible way
 The rather poor performance of programmable PEs
has ever fueled computer architecture research
towards parallelizing the execution of instructions
 Early efforts in parallel computer architectures are
classified according to the deployment of control-
and data-level parallelism
 SISD
 SIMD
 MIMD
 MISD
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 45
 SISD: Single Instruction Single Data
 Traditional von-Neumann kind of
computer architectures
 Sequentially execute a single instruction
stream on a single processing resource
 SIMD: Single Instruction Multiple Data
 Vector processing machines
 Perform a single instruction on multiple
data items in parallel
 Used in architectures for embedded DSP
and graphic applications
 Exploit inherent data-level parallelism
(DLP)
 MIMD: Multiple Instruction Multiple Data
 Traditional homogeneous multi-processor
type of architectures
 Employed in scientific supercomputers
 MISD: Multiple Instruction Single Data
 Rarely encountered class of
architectures,
 Exploit temporal ILP by:
 Setting pipeline stages
 Executing several instructions
simultaneously,
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 46
 Superpipelining:
 Uses deep execution pipelines to
increase the clock frequency
 Superscalarity
 Employs parallel functional units and
complex dispatcher architectures to
dynamically extract Instruction Level
Parallelism (ILP)
 Very Large InstructionWord (VLIW)
 Execute several statically scheduled
instructions on parallel functional
units,
 Hence the effort for ILP extraction is
moved into the compiler
 Hardware Multi-Threading (HW-MT)
 Such architectures are able to
concurrently pursue two or more
threads of control by providing
separate register resources for each
thread context
 Domain Specific (DS) Instruction Set
 Tailors the programmable PE to a
specific application domain
 Provide specialized functional units.
 DS processor examples are Digital
Signal Processors (DSPs) employed in
multimedia and wireless
communications, or Network
Processing Units (NPUs) for networking
applications
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 47
 The applicability of the above listed performance improvement
techniques depends on the considered set of target applications.
 Superpipelining and Superscalarity are heavily used in high
performance General Purpose Processor (GPP) architectures to
increase single thread performance of arbitrary applications on
the vast expense of silicon area and power dissipation.
 On the one hand, embedded applications are severely energy and
cost constrained, but still have significant performance and
flexibility requirements.
 The most promising approach to jointly optimize flexibility and
performance is to exploit coarse-grain TLP instead of ILP and map
the loosely coupled tasks to individually optimized PEs.
 This kind of embedded PEs mostly rely on the more power aware
performance optimization techniques, like VLIW, multi-threading
and a domain specific or even application specific instruction set.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 48
 MIMD control parallelism plays an important
role in embedded SoC architectures
 Parallel execution of specialized PEs offers
 Chance for improving application performance
 Without sacrificing power efficiency
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 49
 Refers to the multiple
instantiation of identical PEs
 Corresponds to a single chip
implementation of the MIMD
principle
 Homogeneous multi-processing of
general purpose embedded micro
controllers
 Achieves the performance scaling
required for control-plane
processing portion of embedded
applications
 Also found for dataplane processing
in domain specific MP-SoC
platforms, where the identical
instruction set of the PEs is tailored
to a certain application domain
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 50
 Employs multiple PEs
 Different PEs individually tailored to a certain task or task set
 Dedicated optimization
 Applicable for the data-plane processing as it allows for a
manual and static task allocation
 The high degree of specialization in heterogeneous multi-
processing further optimizes computational efficiency for a well
defined set of target applications at the expense of generality
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 51
 Parallel execution
 Requires multiple computational resources
 More than one task can be active at the same
point in time.
 Concurrent execution
 Interleaved processing of several tasks on a
single resource,
 At any time only one task can be active
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 52
 Benefit of concurrent execution is
depicted in figure
 2 tasks are mapped to a single
processing element
 Both tasks are divided into 2
processing portions
 These are separated by a
communication request
 After Δtdelay the processing of the
first portion is finished and the task
is blocked for Δtresponse until the
request is accomplished.
 Instead of wasting the processor
resource during this period, the
processor context is swapped to the
second task by a scheduler.
 Utilization of the processor is
increased and the request latency
is hidden
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 53
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 54
The mains aspects of
SoC on-chip communication elements
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 55
 Basic cost, performance,
power, and flexibility
metrics apply.
 Additionally, Quality of
Service (QoS) metrics
known from the
networking application
domain are of increasing
importance to manage
complex on-chip traffic
 The scalability of the
communication
architecture gains growing
attention
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 56
 Bus based on-chip communication paradigm is
derived from the Printed Circuit Board (PCB)
domain.
 Examples:
 VME (Versa Module Eurocard bus)
 PCI (Peripheral Component Interconnect)
 Advantages:
 Easy programming model
 High flexibility
 Abundant availability of Intellectual Property (IP)
 Suited for small and medium scale embedded
systems where a small number of blocks
exchange moderate amounts of data.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 57
 Implement master-slave communication
scheme,
 Active initiators along with passive target
modules are hooked to a shared
communication medium
 Typical masters:
 Processors
 DMA controllers
 Autonomous ASIC blocks,
 Typical slaves:
 Memories
 Co-processors
 Other peripherals
 Other components:
 Arbitration units: Grant the access to the
communication medium to one of the
competing master modules
 Decoder units: Activate the target module
based on the actual address and the address
map, which maps the target modules into
the bus address space
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 58
 Bandwidth
 Is the premier performance metric
 Denotes the maximum transfer capacity of the
bus
 Available bandwidth is measured in bits per
second
 Corresponds to the number of parallel data wires
divided by the bus clock period
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 59
 Pipelining:
 Well known technique to improve the communication
throughput
 Clock frequency is limited by the critical path
 Inserting an additional pipeline stage into the critical
path allows a higher clock frequency
 Yields a higher communication bandwidth
 Since the address decoder is usually integral part of the
critical path, bus transactions in high performance buses
are executed in separate address and data stages
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 60
 Burst modes:
 Improve communication throughput for the linear
access of subsequent addresses by a single
master
 Address counter is incremented automatically
 Next data item is transferred with every cycle
without renewed arbitration
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 61
 Unidirectional data links
 Distinguish on-chip buses from most on-board
buses
 The latter are based on tristate data wires to
maximize the utilization of expensive on-board
wires
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 62
 Hierarchy
 Common bus systems separate high
performance from low performance
communication
 Two buses with different speed
characteristics
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 63
 Multilayer bus architectures
 Provide dedicated point-to-point connections
between distinctive initiators and targets to
eliminate bandwidth bottlenecks
 Required de-multiplexer at the initiator side is
called input stages, the respective target
multiplexer is called output stage
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 64
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 65
 Crossbar bus architectures:
 Provide multiple parallel resources between initiators
and targets
 Significantly improve the traffic throughput
 Degree of parallelism may vary from partial crossbar
to full crossbar architectures, where the latter
provides an individual resource for each connected
target
 Arbitration:
 Can be based on various algorithms,
 Simple round robin
 Fixed, Configurable or dynamic priority schemes
 Static or Dynamic Time Division Multiple Access
(TDMA).
 Even more advanced algorithms are known to
further improve the quality of service.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 66
 Locking of a bus:
 By a single master is a necessary feature to support
read-modify-write kind of semaphore operations.
 This feature is required by most micro-controller
architectures, which run operating systems
 Split transaction buses
 Allow the master to issue multiple requests without
waiting for a response, i.e. request and response are
separated
 Out-of-order execution
 Improves the bus throughput by reordering the sequence
of responses, depending on the availability of the slave
component
 This feature requires advanced state-machines in the
master modules to cope with non-deterministic
sequence of responses
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 67
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 68
 Physical Issues.
 Implemented using a standard cell based semi-custom implementation
flow
 Transmission wires are not physically optimized,
 timing closure issues and unreliable communication links.
 Examples of physical effects are crosstalk noise, electromagnetic
interference, and radiation-induced charge injection
 Synchronous Design.
 Most current bus architectures require all connected modules in a
single clock domain.
 Due to the parasitic capacities of long bus wires, strong driver
transistors are necessary to achieve timing closure
 Leads to power dissipation
 Future SoC designs will follow the Globally Asynchronous Locally
Synchronous (GALS) paradigm,
 Chip-wide wires will span multiple clock domains, which disqualifies
bus architectures as the future chip-level transport mechanism
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 69
 Traffic Management.
 Due to the rather simple arbitration mechanisms, shared buses
provide only rudimentary traffic management support.
 Since the communication pattern highly depends on the spatial
and temporal execution of the application tasks, meeting the
individual QoS requirements like throughput, jitter, or ordering
of the respective tasks is very challenging.
 This also causes the poor scalability of bus-based
communication infrastructures, since every change in the
traffic profile of one part of the application and every
additional component influences the other parts and requires
renewed balancing of the bus architectures.
 Interoperability.
 Although simple standard peripherals, like DMA, IRC, or
memories are available for respective bus systems, it is a
tedious and error-prone task to adapt complex IP blocks to a
specific bus architecture.
 So far efforts to create standard bus interfaces, have not been
successful
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 70
 Alternative on-chip communication
concepts To cope with the
limitations of shared bus
architectures forms the Networks
on Chip (NoC) design paradigm
 Aims to replace current adhoc
wiring of IP blocks with a
disciplined approach where full-
scale on-chip networks provide
communication services according
to the ISO/OSI reference model
 Problems in on-chip
communication like signal integrity
issues, link reliability, or Quality of
Service (QoS) are separately
resolved on the respective OSI
layer
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 71
 The four lower layers of the are of interest
 Physical Layer
 deals with the electrical aspects of the data
transmission
 E.g. signal voltages, clock recovery, and pulse shape
 Data Link Layer
 provides a reliable data transfer over the physical link.
 Error detection by means of block codes and error
correction mechanisms like:
 Automatic Repeat Request (ARQ)
 Forward Error Correction (FEC)
 Network Layer
 implements the arbitration algorithms, buffering
strategies and flow-control mechanisms
 So, the networking layer has dominant impact on the
performance and functional behavior of network.
 Transport Layer protocols
 establish and maintain end-to-end connections.
 The transport layer manages rate-based flow control,
performs packet segmentation and reassembly, and
ensures message ordering
 This abstraction hides the topology of the network,
and the implementation of the links that make up the
network
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 72
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 73
 The challenge in the development of Network-
on-Chip architectures is to combine the know-
how from both the networking and VLSI domain.
 Also the users of on-chip networks have to
understand basic networking principles:
 First the system architect has to specify design time
parameters of the selected NoC architecture like
topology, buffer sizes, arbitration algorithm.
 Later the platform programmer has to configure
runtime parameters like priorities, routing tables,
buffer management thresholds to take advantage of
the capabilities
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 74
 Transport layer is the first to provide
services which are independent of the
implementation of the network
 Enables the platform programmer to
develop embedded software independently
from the interconnect architecture
 A key ingredient in tackling the challenge of
decoupling the computation from
communication
 Interaction with the network becomes
deterministic, rather than prognostic or
reactive like in today’s bus based
communication architectures
 For complex multi-hop networks it is
difficult to provide uniform Quality of
Service (QOS) guarantees like lower
bandwidth bounds, or packet ordering for
the complete on-chip traffic
 To combine high resource utilization with
high QoS requirements of certain traffic
types, researchers in the field of computer
networks distinguish guaranteed services
and best effort service classes
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 75
 Guaranteed Services
 Require resource reservation for worst-case scenarios
 Can be expensive as guaranteeing the throughput for a
stream of data implies reserving bandwidth for the peak
throughput, even when its average is much lower.
 So, resources are often underutilized
 Best-effort Services
 So not reserve any resources, and hence provide no
guarantees.
 Best-effort services utilize resources well as they are
typically designed for average-case scenarios instead of
worst-case scenarios.
 Are also easy to configure,
 Require no resource reservation
 Main disadvantage: unpredictability of the effective
performance
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 76
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 77
 Networking layer is implemented by the
routing nodes of the NoC.
 Router based network implementations
classified as:
 Switching Mode
 Routing Mode
 Queuing
 Congestion Control
 Switching mode:
 Circuit switching
 Connections are set up by establishing a
conceptual physical path from a source to a
destination.
 Links can be shared between two connections
only at different points in time, by using the
time-division multiplexing (TDM) scheme
 Packet switching
 Data is divided into packets and every packet is
composed of a header and the payload.
 The header contains information that is used by
the router to switch the packet to the
appropriate output port
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 78
 Routing mode: applies to packet-switched networks and
defines the way packets are transmitted and buffered
between network nodes
 Store-and-forward
 An incoming packet is received and stored entirely before it is
forwarded to the next node.
 Worm-hole routing
 An incoming packet is forwarded as soon as the packet header is
evaluated and the next router guarantees that the complete packet
will be accepted.
 In case the next hob is blocked, the packet tail potentially blocks
other resources
 Virtual cut-through
 An incoming packet is forwarded as soon as the next router
guarantees, that the complete packet will be accepted.
 In case the next hob is blocked, the packet tail is stored in a local
buffer
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 79
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 80
 Queuing: Buffering strategies can be distinguished by the location of the buffers inside
the router.
 In the following, N denotes the number of bi-directional router ports.
 Input queuing:
 A router has a single input queue for every incoming link.
 Suffers from the so-called head-offline blocking problem, i.e. the router utilization saturates at
about 59%,
 Weak link utilization.
 Output queuing: `
 There are N output queues for every outgoing link resulting in N2 queues.
 Yields optimal performance,
 The costly N2-fold storage and wiring effort prohibits the implementation for a large number of
ports
 Virtual output queuing:
 Combines the advantages of input queuing and output queuing
 Avoids the head-of-line blocking problem.
 Each input port maintains a separate queue for each output port
 Key factor in achieving high performance using VOQ switches is the scheduling algorithm
 Congestion control:
 Packet switched networks without mechanisms for
bandwidth reservation may run into resource
contention and subsequent buffer overflow.
 Several solutions prevent packets from entering until
contention is reduced
 Packet discarding: Simply drops packets in case of buffer
overflow
 Credit based flow control: Packet loss is prevented in a
deterministic way by either signaling congestion via
separate wires (back-pressure) or the receiver regularly
informs the sender about the available buffer space
(window).
 Rate based flow control: the sender gradually adjusts the
traffic generation rate in response to control flow
messages from the receiver. Rate based flow control has
to be implemented by the transfer layer and potentially
suffers from instability due to long control loops
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 81
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 82
 Architectural trends
 Set the stage for the discussion of appropriate system
level design methodologies
 Processing elements
 Requirements for performance, power efficiency and
flexibility
 SIMD, VLIW, super-pipelining, and hardware multi-
threading exploit application inhérent instruction-, data-
, and task-level parallelism
 Communication: Bus Architectures Vs Network-on-
Chip
 Virtualization of architectural resources enables
’divide-and-conquer’
 Embedded control-plane processing tasks are executed
in the user space the Real Time Operating System
(RTOS),
 Embedded data-plane processing tasks are executed on
HW multi-threaded processing elements
 Global communication of control- and data-plane
processing elements is performed by elaborated on-chip
networks
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 83
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 84
Delivered by:
Subhash Iyer,
Program Head,
Soft Polynomials (I) Pvt. Ltd., Nagpur
(CDAC ATC)
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 1
High Level Synthesis
Low Power Design
2Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 At each level of circuit abstraction, the circuit is equivalent and
performs the same target operation, but its structural
components (and hence the component’s granularity) are
different, and the design issues may be different
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 3
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 4
 System level:
 Highest level circuit abstraction
 The system is specified as processes and tasks
 A mix of hardware and software.
 Concerned with overall system structure and information flow.
 Computer systems are described as an interconnected set of
processors, memories and switches
 Behavioral level, algorithmic level or high level
 Also called as instruction set level or algorithmic level.
 Focus is on the computations performed by an individual processor;
i.e., the way it maps sequences of inputs to sequences of outputs
 Architecture, microarchitecture, RTL
 Viewed as a set of interconnected storage elements and functional
blocks.
 Behavior of the system is described as a series of data transfers and
transformations between the storage elements
 Microarchitectural-level representation of the chip resources, such as
adders and subtractors, is determined along with decisions such as
single-cycle, multicycle, pipelined or superscalar implementation
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 5
 Logic level
 System is described as a network of gates and flip-flops,
 Behavior is specified by logic equations
 Circuit is represented in the form of a netlist at which level logic
realizations of functional blocks are determined
 Circuit or transistor level
 Circuit is a netlist of transistors
 Decisions such as how and what types of transistors will be used,
complementary CMOS, pass transistors, etc. are the main issues
 Physical or layout level
 System is specified in terms of the individual transistors of which it is
composed
 Behavior of the system can be described in terms of the network
equations
 Lowest level of circuit abstraction
 Chip is a sequence of layers (masks), each layer of which is composed
of polygons.
 It is this level that is transferred to the manufacturing process
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 6
 Design automation terminology,:
 Optimization
 Synthesis
 Analysis
 In circuit analysis, the behavior or
characteristics of a circuit are studied
 The task of synthesis is to take the
specifications of the behavior required for
a system and a set of constraints and goals
to be satisfied and to find a structure that
implements the behavior while satisfying
the goals and constraints
 Behavior, structure and physical design: 3
domains in which hardware is described
 “Behavior”:
 Refers to the ways in which the system or its
components interact with their environment
(mapping from inputs to outputs)
 interest is in what a design does, not in how it is
built
 “Structure”
 Refers to the set of interconnected components
that constitute the system (described by a netlist)
 Focus on constraints, such as area, cost and delay.
 “Physical” design
 Mapping of the structure onto the technology
 Ignores what the design is supposed to do
and binds its structure in space or to
silicon
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 7
 The automatic design process of VLSI circuits is called synthesis
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 8
 System-synthesis process partitions the tasks
into hardware, software and their
communications
 High-level synthesis process is the translation
from behavioral description to its equivalent
structural description
 Logic synthesis is the process of mapping
from the design at the RTL to a gate-level
representation that is suitable for input to
physical design
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 9
 Physical design then addresses aspects of chip
implementation
 Floor planning
 Placement
 Routing
 Extraction
 Performance analysis
 Output of physical design is the handoff
(“tapeout”) to manufacturing
 A generalized data stream, GDSII, stream file
 Verification of correctness
 Design rules
 Layout versus schematic
 Constraints (timing, power, reliability, etc.)
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 10
 During each phase of the synthesis process,
the functional equivalence of two
consecutive phases is to be checked to
ensure that they are functionally the same
 A power and timing analysis study can be
done by using compact models at the
transistor level
 At the physical level, more accurate power
and time analysis is possible through the
extraction of accurate parasitics
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 11
 High-level synthesis is the translation process
from a behavioral description to a structural
description
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 12
 Analogous to “compilation” that
translates a high-level language
program in C/C++ to an assembly
language program
 HLS Also known as behavioral-level
synthesis or algorithmic-level
synthesis.
 Constraints to be considered in HLS
are:
 Area
 Performance
 Power consumption
 Reliability
 Testability
 Cost.
 HLS synthesis allows a design engineer
to make decisions at an early stage of
the design cycle, thus ensuring
correct design.
 Typical steps involved are scheduling,
binding, allocation, etc.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 13
 Advantages:
 Continuous and reliable design flow
 From system-level abstraction to RTL abstraction automatically without manual handling
 Automatic translations from high-level specifications in the form of C or SystemC to RTL description of
the circuit in the form of VHDL or Verilog.
 Shorter design cycle
 More automation: faster designs, lesser cost
 Fewer errors
 Synthesis process can be verified easily, so the chances of errors will be smaller.
 Correct design decisions at the higher levels of circuit abstraction can ensure that the errors are not
propagated to the lower levels, which are too detailed and costly to correct
 Easy and flexible to search the design space
 Synthesis system can produce several designs in a short time
 So, the designer has more flexibility to choose the proper design considering different trade-offs of
power, leakage, area and delay.
 Balanced degree of freedom for power optimization
 Power and performance optimization can be performed at any level of circuit abstraction
 As the level of abstraction goes lower, the complexity of the circuit increases
 Additionally, the degrees of freedom, and thus power reduction opportunities decrease
 Hence, high level or behavioral level is an attractive level and provides a balanced degree of freedom
for design space exploration.
 Documenting the design process
 Automated system can track design decisions and their effects
 Design debugging and continuation by third parties can be easily done
 Useful for macrocell-based design and the sale of designs as intellectual property cores
 Availability of circuit technology to more people
 Design expertise is moved into synthesis systems
 It becomes easier for a non-expert to produce a chip that eets a given set of specifications
 Cost of manpower required reduces
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 14
 The high-level synthesis process
takes a system in the form of a
hardware description language
(HDL) as input and generates an
optimal RTL description by:
 Compilation
 Transformation
 Scheduling
 Allocation
 Binding
 Other steps
 Power optimization
 Leakage optimization
 Register optimization
 Interconnect optimization
 Take place in synthesis either
sequentially or along with the
fundamental steps
 No fixed sequence for
performing various high-level
synthesis tasks
 They are independent of each
other
 Yet, these tasks should be
performed simultaneously for
effective optimization
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 15
 The behavior of a system to be synthesized is
usually specified at the algorithmic level using a
high-level programming language like C/C++ or a
hardware description language (HDL) such as
VHDL and Verilog.
 The behavior of the system is then compiled into
internal representations, which are usually data
flow graphs (DFGs) and control flow graphs
(CFGs).
 Each behavioral specification is transformed into
a unique graphical representation.
 The DFG is a directed graph that represents data
movement, whereas the CFG is a directed graph
that indicates the sequence of operations.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 16
 In the transformation step, the initial DFG is
transformed so that the resultant DFG is more
suitable for scheduling and allocation.
 These transformations include compiler-like
optimizations such as dead-code elimination,
common sub-expression elimination, loop
unrolling, constant propagation and code
motion.
 In addition, some hardware-specific
transformations like minimization of syntactic
variances and retiming may be applied to take
advantage of the associativity and commutativity
of certain operations
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 17
 Scheduling is the process of partitioning the set of
arithmetic and logical operations in the DFG into
groups so that the operations in the same group can
be executed concurrently, while taking into
consideration possible trade-offs between the total
execution cost and hardware cost.
 A group of concurrent computations to be executed
simultaneously is referred to as a control step.
 The total number of control steps needed to execute
all operations in the DFG, the minimum number of
functional units of each type to be used in the design
and the lifetimes of the variables generated during
the computation of operations are determined in the
scheduling step.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 18
 Selection is the process of choosing resources from the
library, which involves tradeoffs according to different
features like delay, area, power and leakage.
 Resource allocation is the process of determining the
number of functional units of each type for performing
operations, memory units (registers) for storing data
values and interconnects for data transportation.
 Often, the selection and allocation processes are a single
task.
 Allocation is further divided into sub-tasks, such as
functional unit allocation, memory unit allocation and
interconnect allocation.
 Resource allocation and binding may share resources so
that the same hardware can be used to execute different
operations or so that the same register can be used to
store more than one variable.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 19
 Binding or assignment is the process of assigning
variables to memory units and data transfers to
interconnections.
 Binding is further divided into several sub-tasks, such
as functional unit binding, memory unit binding and
interconnect binding.
 Functional unit binding involves the mapping of
operations in the behavioral description into a set of
selected functional units.
 Memory unit binding maps data carriers (constants,
variables, arrays) in the behavioral description onto
storage elements (read-only memories, registers,
memory units) in the data path.
 The interconnect binding task maps every data
transfer in the behavior onto a set of interconnection
units for data routing.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 20
 In the output generation phase, design
output is generated.
 The output should be in a form such that
logic-level synthesis tools can optimize the
combinational logic and layout synthesis
tools can design the chip geometry.
 The generated output is generally in a low-
level HDL, such as structural VHDL
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 21
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 22
 Data Path Synthesis
 Control Synthesis
 The controller is typically a finite state machine
that is either microcoded or hardwired
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 23
 HLS is important for several reasons
 Reduction of design cycle time
 Rapid design space exploration at the higher level of
abstraction
 Wrong decisions are not propagated to lower levels of design
abstraction,
 HLS involves several important steps, such as:
 Scheduling
 Allocation
 Binding
 Several graph theoretical algorithms are available that can
perform optimization while performing these tasks.
 Two Types
 Data path
 Control synthesis
 There are existing tools to perform high-level synthesis
explicitly, and some tools perform the behavioral to RTL
compilation as an intermediate process.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 24
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 25
Delivered by:
Subhash Iyer,
Program Head,
Soft Polynomials (I) Pvt. Ltd., Nagpur
(CDAC ATC)
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 1
Introduction to SoC Design Methodology
2Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 Design flow of
integrated circuits
 Application phase
 Implementation
phase
 Both are decoupled
 Application to
implementation
 A specification
document written by:
 Application team
 System architecture
specialist
 Ad-hoc and informal
approach
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 3
 Problems
 Ambiguity of the informal specification
document leads to misinterpretations and
implementation errors
 Lack of reliable performance information before
the implementation often causes an over- or
under-provisioning of processing and
communication resources
 Quality of results mainly depends on the intuition
and experience of the system architect
 Manual creation of the verification environment
requires significant effort and again represents a
potential source of inconsistencies with the
original design intend
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 4
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 5
 Electronic System Level
(ESL)
 Application is jointly
considered with the system
architecture to find a
feasible and cost effective
application to architecture
mapping
 The declared goal of ESL
design is to increase the
engineering productivity and
quality of results during the
specification of the MP-SoC
platform architecture and
application mapping
 New design paradigm to cope
with the:
 complexity
 economics
of the emerging billion-transistor
System-on-Chip era.
 Architecture centric definition
 We define platform-based design
as the creation of a stable
microprocessor-based architecture
that can be rapidly extended,
customized for a range of
applications, and delivered to
customers for quick deployment
 Design process based definition
 The general definition of a
platform is an abstraction layer in
the design flow that facilitates a
number of possible refinements
into a subsequent abstraction
layer in the design flow
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 6
 Multiple, almost orthogonal phases
 Functional phase
 Performed by application specialists
 Completely agnostic to architectural considerations.
 Includes
 Embedded SW development of the control-plane portion
 Data-plane algorithm development
 The latter is carried out using highly application domain specific tools and methodologies
 MP-SoC platform phase
 All designs tasks, which have to be performed under consideration of the full functional and
architectural complexity the MP-SoC platforms
 Example
 Specification of the system-architecture
 Mapping of the application onto the MP-SoC platform
 Development of the hardware dependant Software layers
 High-level IP creation phase
 Design of processing elements (RISC, DSP, MCU, ASIPs)
 On-chip interconnect technologies (busses, NoC),
 Somain specific standard I /O (PCI-variants, SPIx variants, HyperTransport, I2C, FireWire, QDR,
etc.),
 Creation of well defined ASIC IP blocks (e.g. an MPEG4 video codec).
 Not completely orthogonal to the functional phase, since the design of application specific
processing elements and communication IP indeed depends on the considered application
 Semiconductor technology and basic IP creation phase
 Covers standard cells, I/O, memories and the basic technology processes supporting them.
 More heterogeneous technologies, combining embedded DRAM, embedded Flash, mixed-signal
BiCMOS, RF, and analog
 More to do with fabrication technologies
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 7
 Represent the results of the
functional phase as a well
defined application model
as the Executable
Specification of the system
 System architecture needs
to be defined in terms of
mapping the application
model to the hardware
(Main Task)
 Embedded SW development
 Hardware-Software co-
verification task: RTL is
verified along with
embedded software
 Methodology used:
Transaction Level Modeling
(TLM)
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 8
 Engineering of integrated circuits has always employed
models on different levels of abstraction
 Model: unique, idealized description of the considered system
 Degree of abstraction characterizes the type of model used in
the respective design phase
 Goal of abstraction is to provide a description the system,
 which is simple enough
 yet sufficiently accurate to enable the necessary investigations
 take design decisions
 proceed to the next design phase.
 Indeed, the design-flow of an embedded system can be
considered as a sequence of steps which successively
reduce the degree of abstraction in the system model
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 9
 Functionality refers to the modeling of the
system behavior
 On the highest level of abstraction, the
functionality is condensed to pure
mathematic expressions.
 Later the functionality is refined to
operators,
 Finally mapped to logic gates
 Timing model captures the temporal
properties of the system
 Degree of abstraction ranges from causality
of events to physical timing of transistors
and wires
 Data representation
 Higher level data resolution is reduced to
Tokens and Abstract Data Types (ADT)
 Lower levels employ word or bit
representations.
 The Component granularity describes the
finest resolution of the sub-blocks
 First the component resolution is restricted
to coarse-grain building blocks,
 Finally the complete embedded system is
composed of fine-grain silicon transistors.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 10
 Creation of a system model
requires:
 Modeling language
 Well defined execution semantic
coordinating the activation of the
individual blocks
 Model of Computation (MoC) is
composed of two parts:
 Coordination language describes
basic execution semantics with
respect to properties like
parallelism, synchronism,
reactivity and provides the
abstracted communication
mechanism
 The host language provides the
language elements for the
specification of the system
models
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 11
 Characterized by the total temporal ordering of
all occurring communication events
 Example is the discrete event simulation MoC,
which defines the execution semantics for HDL
simulators
 Further examples of timed MoCs are synchronous
languages like Esterel, Lustre, or Signal, where
the events of all communication signals are
constrained to occur at identical time stamps
 Thanks to their sound mathematical foundation,
synchronous languages have gained adoption for
the specification, analysis and code-generation
of reactive control-dominated applications
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 12
 Characterized by the fact, that communication
events are only partially ordered
 However, various untimed MoCs are popular for
the specification of both data and control
dominated applications
 Data-Flow MoCs are heavily employed for algorithmic
modeling and analysis of signal processing
applications
 Communicating Sequential Processes (CSP) and
Calculus for Communicating Systems (CCS) are
prominent untimed MoCs which are based on
sequential processes that communicate using a
rendezvous communication mechanism.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 13
 The definition of a proper MoC has long been considered to
be the silver bullet for system level design and by that for
the solving of the design productivity crisis
 Initially, the complete system functionality is to be created
using the ideal MoC, which provides highest modeling
efficiency, simulation speed, and smooth IP reuse
 Next, the initial specification would be automatically
verified using formal verification technology and metrics
like determinism, causality, dead-lock absence,
consistency, completeness, and fairness. The golden
system specification would then provide the foundation for
an automated path to design space exploration to take
functional and architectural design decisions
 Finally, system level synthesis would be applied to the
partitioned system specification providing an automated
path to implementation.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 14
 Object Oriented Programming (OOP) is a powerful abstraction
mechanism,
 Data and functionality is partitioned and encapsulated inside classe
 OOP based languages: UML,C++, or Java
 Widely adopted in engineering of arbitrary SW
 Gaining importance for the specification of embedded control-plane
processing
 OOP components interact primarily by sequentially transferring
control through method calls
 Sequential nature of OOP hinders the intuitive specification,
analysis and refinement of the inherent parallel data-plane
processing tasks
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 15
 For this purpose the actor-oriented abstraction
scheme has been conceived, where parallel
objects interact by sending and receiving
messages
 Within an actor-oriented design environment,
the designer can focus on the specification and
analysis of the algorithmic behavior of the
individual tasks whereas the communication and
synchronization aspects are handled by the
underlying parallel Model of Computation
 SystemC allows Actor Oriented Programming
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 16
 Actor-based design languages achieve high modularity in communication modeling by
using the Interface Method Call (IMC) principle
 IMC mechanism is realized by A set of language elements for
 Modules
 Ports
 Interfaces
 Channels.
 Processes modeling the behavior are wrapped into modules and access communication
services through ports
 Available methods are
 Declared in the interface specification
 Implemented by the channel
 Thus the access methods in an interface reflect the specialized properties of the
communication style implemented by an particular channel
 Actor-oriented design languages offers a generic Model of Computation, which in case of
SystemC is based on an event driven simulation kernel
 Channels serve as containers for communication and synchronization
 The user can extend the generic MoC by creating his own methodology specific channel
library
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 17
 Challenge of System Level Design
 The architecture definition and application
mapping have to be considered jointly by taking
the full functional and architectural complexity
into account
 In case of a fixed target platform, SLD is
reduced to the application mapping task,
which as a synonym term is also called the
partitioning of the application
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 18
 Orthogonalization of concerns with respect to all modeling attributes
generally enables a divide-and-conquer approach to System Level
Design
 Separation of interfaces and behavior according to the interface
based design paradigm fosters successive communication and structural
refinement as well as IP reuse
 High modeling efficiency and simulation speed is mandatory to
handle the high complexity of SoC designs
 Incorporation of hardware specific concepts like timing, reactivity,
parallelism, and determinism to express the impact of the platform
architecture
 Incorporation of software specific concepts like Object Oriented
Programming, Operating System (OS) encapsulation, Inter Process
Communication (IPC), process concurrency, as well as the creation,
mutual preemption, and termination of processes to enable smooth
integration of the embedded Software part.
 Support for Verification and Validation verification, to first gain
evidence on the highest possible level of abstraction, that the correct
system is being developed and all performance and cost requirements
are met (validation). Later, the validated specification should be reused
as a golden reference model for the subsequent refinement, IP
integration and implementation steps (verification).
 Seamless transition between design phases and abstraction levels
from system to gates to avoid long iteration cycles caused by gaps in
the design flow.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 19
Question Remains - - - How to do it???
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 20
More design aspects
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 21
 HW/SW Co-simulation has been recognized as a
necessary ingredient for HW/SW Co-design.
 First HW/SW Co-simulation prototypes linked
Hardware Description Language (HDL) simulators to
an ISS (Instruction Set Simulators) executing the
Software part.
 Soon, HDL/ISS Co-simulation environments like
became commercially available and are still idely
employed.
 This HDL/ISS approach is severely limited by the slow
simulation speed of the HDL simulator, especially in
case of large systems with several ISSes and
significant hardware portions.
 The concept of flexible hardware abstraction levels
has been developed,
 Here accuracy can be traded against simulation
speed.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 22
 Maximum simulation speed can be achieved
by using compiled ISS technology together
with highly abstract functional SystemC
models of the hardware part
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 23
 The original goal of HW/SW Co-design was to reach the same
degree of tool automation known from RTL synthesis, i.e. a
formalized system specification is automatically partitioned and
synthesized to the optimal target architecture
 automated HW/SW partitioning and System Synthesis have never
gained industrial relevance
 Partitioning decision metric is restricted to worst case execution time,
 Other important metrics like average performance, cost, and power
dissipation are not taken into account.
 Even the worst case execution time proved to be hard to estimate in
the general case of parallel, data dependent, and interleaved software
execution
 HW/SW partitioning and automated synthesis is still not
recognized as a dominant issue
 system architects are interested in the impact on performance of
a specific target architecture
 To partly automate this mapping,
 Communication Synthesis
 HW/SW Interface Synthesis
emerged as new branches of HW/SW Co-design
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 24
 Techniques for the analysis of communication requirements and synthesis
of the communication architecture
 As of today, Communication Analysis and Synthesis techniques need
further advancement to cope with emerging Network-on-Chip
architectures.
 One attempt is to instantiate the NoC library elements (routers, network
interfaces, links) from a high-level view of the SoC floorplan
 Selection of the actual library elements can be in different ways:
 In a application-centric approach, the network topology can be generated from a
communication graph of the application
 In an architecture-centric approach, the communication architecture can be
refined from an abstract channel view via a network topology view towards a
micro-architecture view .
 So far the analysis of Network on Chip architectures is performed using
handcrafted simulation models, which are mostly based on SystemC
 The absence of standardized APIs, abstraction levels and modeling
frameworks beyond the plain SystemC language so far hinders the
creation of interoperable IP models for NoC architectures.
 Some of the current projects working on a unified modeling environment
for the exploration of NoC architectures are discussed in section 5.3.3
below.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 25
 Here, the designer decides on the
partitioning and architecture mapping
 The realization of these decisions are
supported by automating the tedious task of
generating the required Software driver
functions as well as the Hardware glue-logic
 Recently the technology has been ported to
the SystemC
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 26
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 27
 MP-SoC platform phase is
concerned with:
 System architecture specification
 Application mapping
 Abstraction concepts on this level
have to support the joint
consideration of application and
architecture
 High level of detail inherent to
Register Transfer Level (RTL)
implementation models prohibits
the investigation and optimization
across heterogeneous
communication and processing
elements
 Significant research has been spent
on the definition of the
appropriate System Level Design
language.
 Today SystemC is generally
considered as the standard
language for all kinds of SLD tasks.
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 28
 SystemC has initially been conceived to replace VHDL and
Verilog as a Hardware Description Language
 For this reason it naturally provides all hardware specific
concepts e.g., time, parallelism, and hierarchy
 With version 2.0 SystemC has been thoroughly revised to
become a fully elaborated actor oriented design language
 The incorporated Interface Method Call (IMC) principle
enables a clean separation of interfaces and behavior as
well as orthogonalization of further modeling attributes
 All kinds of methodology and application domain specific
Models of Computation (MoC) can be implemented on top
of the generic event-driven SystemC simulator
 SystemC 2.0 enables a smooth transition from functional
phase to the MP-SoC platform phase, e.g. hybrid
simulation of an architecture model in the context of an
algorithmic Data-Flow model
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 29
 Since SystemC is a native C++ library, it
inherently supports Object Oriented
Programming
 Final version 2.1 of the language has become
an official IEEE standard
 Development of the Transaction Level
Modeling (TLM) kit
 Synthesizable subset of SystemC
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 30
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 31
 The characteristic property of TLM:
 Pin-level communication interface of RTL models
replaced by a set of interface methods.
 This IMC based communication mechanism is
provided by all actor-oriented specification
languages
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 32
 SystemC based TLM has demonstrated the potential
in terms of increased simulation speed and modeling
efficiency
 The basic TLM API consist of a bidirectional transport
and a set of unidirectional put and get interfaces
 The bidirectional transport has blocking
synchronization
 Implementation of the interface is allowed to call
wait(.)
 The unidirectional interfaces are available in a
blocking and a non-blocking version
 These interfaces can be seen a foundation layer for
the creation of more advanced TLM interfaces, which
serve a specific methodology or model a specific
communication protocol
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 33
 The two cycle-level TLM
layers
 Bus Accurate (BA)
 Cycle Callable (CC)
 These levels are
particularly suitable to
create a cycle-accurate
prototype of the system
architecture
 The (usually cycle-
accurate) Instruction Set
Simulators (ISS) of the
programmable
architectures are
connected to cycle- and
bit-accurate models of
memories, communication
resources and peripherals
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 34
 BA and CC difference:
 BA captures a transaction
within a single method call,
 CC models provide separate
methods for every phase of
a transaction.
 The Programmer’s View
(PV) abstraction levels
address early integration
of (usually instruction
accurate) ISSes for SW
development purposes
 PV provides a bit and
address-map accurate view
of the MP-SoC architecture
context for the
programmable processing
elements
 PV is based on the
bidirectional blocking
transport API
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 35
 The Open Core Protocol International Partnership
(OCP-IP) is getting a lot of traction throughout the
industry
 OCPIP provides a high configurable SoC protocol and
their System Level Design working group has worked
from the early days on Transaction Level Modeling
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 36
 Lowest level: Transaction Layer 1 (TL1)
 provides a fully cycle accurate model of the OCP protocol
 Fully aligned with the CC abstraction level from OSCI.
 Next higher level: Transaction Layer 2 (TL2)
 Represents basically a cycle-approximate abstraction of the OCP protocol.
 The API contains a large number of OCP specific features
 like e.g. thread-busy, handshaketiming, or sideband signals.
 The timing is not cycle accurate, but can be annotated to a near-cycle accurate level
 Highest Level: Transaction Layer 3 (TL3)
 protocol agnostic subset of TL2
 API is limited to a concise set of primitives,
 Model timing approximate on-chip communication
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 37
 PV TLM platforms for early SW development as well
as cycle-level TLM for HW/SW and TLM/RTL co-
verification are successfully deployed throughout the
industry
 However, both use-cases solve only parts of the
challenges during the MP-SoC design phase
 Especially the architecture definition and task
partitioning is not adequately addressed
 PV platforms simulate very fast and are well suited
for SW development
 Unfortunately they do not contain sufficient timing
information for architectural investigations
 The blocking semantics of the underlying
bidirectional transport API hinders the smooth
annotation of further timing information
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 38
 Cycle accurate models of the SoC platform are too
detailed and too slow for architecture definition and task
partitioning
 First, the effort to create such a cycle-accurate model of
the complete platform is way too high to allow for the
investigation of a large number of architecture and
application mapping alternatives
 Second the reachable simulation speed in the order of
100k cycles per second is not sufficient for the analysis of
large design parameter choices
 As a result, the exploration of broad design spaces is still a
cumbersome process in cycle-level TLM based design flows
 Cycle-level TLM communication models have architecture
specific interfaces.
 Thus, every time the designer is inclined to explore a new
communication architecture he has to change the
interface of the connected functional models
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 39
 For this reason the Design Space Exploration framework
deploys a generic synchronization interface, which
provides the same primitives as the newly standardized
OCP TL3 API
 Obviously, the TL3 API presents the best fit for this purpose
 It is compliant with the OSCI TLM standard
 Additionally, it is of reasonable complexity, and yet offers
sufficient expressiveness to meet the accuracy
requirements for design space exploration
 By deploying SystemC based Transaction Level Modeling
the framework is nicely integrated into the flourishing ESL
ecosystem.
 This method is interoperable with the PV and cycle-
accurate modeling methodologies and can benefit from the
commercial tool support, available IP models, and
established ESL design methodologies
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 40
 Component Based Design
 Ffounded on the assumption, that the processing elements and
communication templates are available IP blocks
 Communication Based Design:
 envisions MP-SoC platform design as a composition of reusable
IP blocks
 Different from Component Based Design
 Omits the consideration of processing elements
 Is exclusively focused on the conceptualization and
implementation of the communication architecture.
 Communication Based Design can be seen as the corresponding
design paradigm to match emerging NoC architectures.
 Design Space Exploration (DSE) Environment
 The goal is to take early design decisions with respect to
system architecture and application mapping on the basis of an
abstract performance model.
 The embedded application needs to be modeled together with
the MP-SoC architecture at a high level of abstraction
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 41
Introduction to SoC
Design Space Exploration (DSE)
Methodology
42Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
 Ultimate goal is to meet the System Level Design
requirements as specified and to cope with the
full architectural complexity of emerging MP-SoC
architectures
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 43
 MP-SoC Framework follows the y-chart
principle
 Set of functional application models is
merged with a set of architecture
models in a dedicated mapping step
 Developed embodiment of the y-chart
principle is called Virtual Architecture
Mapping (VAM) which comprises of:
 Well defined abstraction level above
cycle-level TLM for efficient modeling
of embedded applications
 Set of generic, parameterizable
architecture models, which capture the
notion of shared and resource limited
architectural fabrics for communication
and computation
 Rigorous definition of a timing model,
that embodies the performance of a
selected application-architecture-
mapping
 MP-SoC simulation framework featuring
a declarative mapping mechanism to
minimize turn-around times during the
iterative architecture exploration cycle
 Comprehensive set of analysis tools for
functional and performance validation
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 44
Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 45

Soc - Intro, Design Aspects, HLS, TLM

  • 1.
    Delivered by: Subhash Iyer, ProgramHead, Soft Polynomials (I) Pvt. Ltd., Nagpur (CDAC ATC) Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 1
  • 2.
    2Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd.
  • 3.
     Introduction  Whatis SoC ?  SoC characteristics  Benefits and drawbacks  Solution  Major SoC Applications  Summary 3Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 4.
     Technological Advances today’s chip can contains 100M transistors  transistor gate lengths are now in term of nano meters  approximately every 18 months the number of transistors on a chip doubles – Moore’s law  The Consequences  components connected on a Printed Circuit Board can now be integrated onto single chip  hence the development of System-On-Chip design 4Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 5.
    System on aboard System on a Chip 5Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 6.
    6Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd.
  • 7.
    7Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd.
  • 8.
     Introduction  Whatis SoC ?  SoC characteristics  Benefits and drawbacks  Solution  Major SoC Applications  Summary 8Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 9.
    Version A: The VLSImanufacturing technology advances has made possible to put millions of transistors on a single die. It enables designers to put systems-on-a- chip that move everything from the board onto the chip eventually. Version B: SoC is a high performance microprocessor, since we can program and give instruction to the uP to do whatever you want to do. Version C: SoC is the efforts to integrate heterogeneous or different types of silicon IPs on to the same chip, like memory, uP, random logics, and analog circuitry. All of the above are partially right, but not very accurate!!! 9Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 10.
    • SoC notonly chip, but more on “system”. • SoC = Chip + Software + Integration • The SoC chip includes: • Embedded processor • ASIC Logics and analog circuitry • Embedded memory • The SoC Software includes: • OS, compiler, simulator, firmware, driver, protocol stack • Integrated development environment (debugger, linker, ICE) • Application interface (C/C++, assembly) • The SoC Integration includes : • The whole system solution • Manufacture consultant • Technical Supporting 10Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 11.
     A typicaldigital system design involves a significant amount of custom logic circuitry, but also includes pre-designed major components, such as processors, memory units and various types of input/output (I/O) interfaces.  In the traditional approach for designing such systems, a new integrated circuit (IC) chip is created for the custom logic circuits, but each pre-designed component is included as a separate chip  Different approach for realizing digital systems, called embedded system design. It leverages the advanced capabilities of today's IC technology by implementing many of the components of the system within a single chip, such as a field programmable gate array (FPGA). 11Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 12.
     Offer largelogic capacity, exceeding several million equivalent logic gates, and include dedicated memory resources  Include special hardware circuitry that is often needed in digital systems, such as digital signal processing (DSP) blocks (with multiply and accumulate functionality) and phase-locked loops (PLLs) (or delay-locked loops (DLLs)) that support complex clocking schemes  Support a wide range of interconnection standards, such as double data rate (DDR SRAM) memory, PCI and high-speed serial protocols. 12Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 13.
    13Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd.
  • 14.
     Introduction  Whatis SoC ?  SoC characteristics  Benefits and drawbacks  Solution  Major SoC Applications  Summary 14Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 15.
    Top Level Design UnitBlock Design Integration and Synthesis Trial Netlists System Level Verification Timing Convergence & Verification Fabrication DVT DVT Prep 6 12 12 4 14 ?? 5 8 Time in Weeks Time to Mask order48 61 Unit Block Verification ASIC Typical Design Steps • Typical ASIC design can take up to two years to complete 15Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 16.
    Top Level Design UnitBlock Design Integration and Synthesis Trial Netlists System Level Verification Timing Convergence & Verification Fabrication DVT DVT Prep 4 14 5 4 Time in Weeks Time to Mask order24 33 Unit Block Verification 4 2 • With increasing Complexity of IC’s and decreasing Geometry, IC Vendor steps of Placement, Layout and Fabrication are unlikely to be greatly reduced • In fact there is a greater risk that Timing Convergence steps will involve more iteration. • Need to reduce time before Vendor Steps. • Need to consider Layout issues up-front. SoC Typical Design Steps 16Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 17.
     Design reuseis facilitated if “standard” internal connection buses are used .  All cores connect to the bus via a standard interface .  Any-to-any connections easy but …  Not all connections are necessary .  Global clocking scheme .  Power consumption .  Standardization is being addressed by the Virtual Socket Interface Alliance (VSIA) 17Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 18.
    • AMBA (AdvancedMicrocontroller Bus Architecture) is a collection of buses from ARM for satisfying a range of different criteria. • APB (Advanced Peripheral Bus): simple strobed- access bus with minimal interface complexity. Suitable for hosting peripherals. • ASB (Advanced System Bus): a multimaster synchronous system bus. • AHB (Advanced High Performance Bus): a high- throughput synchronous system backbone. Burst transfers and split transactions. 18Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 19.
    • One solutionto the design productivity gap is to make ASIC designs more standardized by reusing segments of previously manufactured chips. • These segments are known as “blocks”, “macros”, “cores” or “cells”. • The blocks can either be developed in- house or licensed from an IP company. • Cores are the basic building blocks . 19Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 20.
    • Soft Macro –Reusable synthesizable RTL or netlist of generic library elements – User of the core is responsible for the implementation and layout • Firm Macro – Structurally and topologically optimized for performance and area through floor planning and placement – Exist as synthesized code or as a netlist of generic library elements • Hard Macro – Reusable blocks optimized for performance, power, size and mapped to a specific process technology – Exist as fully placed and routed netlist and as a fixed layout such as in GDSII format . 20Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 21.
    Reusability portability flexibility Predictability, performance, timeto market Soft core Firm core Hard core 21Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 22.
    • Locating therequired cores and associated contract discussions can be a lengthy process – Identification of IP vendors – Evaluation criteria – Comparative evaluation exercise – Choice of core – Contract negotiations • Reuse restrictions • Costs: license, royalty, tool costs – Core integration, simulation and verification 22Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 23.
    23Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd.
  • 24.
     MPSoC isa system-on-chip that contains multiple instruction-set processors (CPUs).  The typical MPSoC is a heterogeneous multiprocessor: there may be several different types of processing elements (PEs), the memory system may be heterogeneously distributed around the machine, and the interconnection network between the PEs and the memory may also be heterogeneous.  MPSoCs often require large amounts of memory. The device may have embedded memory on-chip as well as relying on off-chip commodity memory. 24Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 25.
     These chips have: •one (several) processors • large amounts of memory • bus-based architectures • peripherals • coprocessors • and I/O channels 25Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 26.
     Introduction  Whatis SoC ?  SoC characteristics  Benefits and drawbacks  Solution  Major SoC Applications  Summary 26Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 27.
    • There areseveral benefits in integrating a large digital system into a single integrated circuit . • These include – Lower cost per gate . – Lower power consumption . – Faster circuit operation . – More reliable implementation . – Smaller physical size . – Greater design security . 27Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 28.
    • The principledrawbacks of SoC design are associated with the design pressures imposed on today’s engineers , such as : – Time-to-market demands . – Exponential fabrication cost . – Increased system complexity . – Increased verification requirements . 28Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 29.
     Why doesit take longer to design SOCs compared to traditional ASICs?  We must examine factors influencing the degree of difficulty and Turn Around Time (TAT) (the time taken from gate-level netlist to metal mask-ready stage) for designing ASICs and SOCs.  For an ASIC, the following factors influence TAT: • Frequency of the design • Number of clock domains • Number of gates • Density • Number of blocks and sub-blocks  The key factor that influences TAT for SOCs is system integration (integrating different silicon IPs on the same IC). 29Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 30.
     Introduction  Whatis SoC ?  SoC characteristics  Benefits and drawbacks  Solution  Major SoC Applications  Summary 30Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 31.
    • Overcome complexityand verification issues by designing Intellectual Property (IP) to be re- usable . • Done on such a scale that a new industry has been developed. • Design activity is split into two groups: – IP Authors – producers . – IP Integrators – consumers . • IP Authors produce fully verified IP libraries – Thus making overall verification task more manageable • IP Integrators select, evaluate, integrate IP from multiple vendors – IP integrated onto Integration Platform designed with specific application in mind 31Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 32.
    32Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd.
  • 33.
    IP cores areclassified into three distinct categories:  Hard IP Cores  Firm IP Cores  Soft IP Cores 33Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 34.
    Hard IP coresconsist of hard layouts using particular physical design libraries and are deliverid in masked-level designed blocks (GDSII format). The integration of hard IP cores is quite simple, but hard cores are technology dependent and provide minimum flexibility and portability in reconfiguration and integration. 34Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 35.
    Soft IP coresare delivered as RTL VHDL/Verilog code to provide functional descriptions of IPs. These cores offer maximum flexibility and reconfigurability to match the requirements of a specific design application, but they must be synthesized, optimized, and verified by their user before integration into designs. 35Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 36.
    Firm IP coresbring the best of both worlds and balance the high performance and optimization properties of hard IPs with the flexibility of soft IPs.These cores are delivered in form of targeted netlists to specific physical libraries after going through synthesis without performing the physical layout. 36Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 37.
    Resusability portability flexibility Predictability, performance, timeto market Soft core Firm core Hard core 37Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 38.
    38Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd.
  • 39.
     Introduction  Whatis SoC ?  SoC characteristics  Benefits and drawbacks  Solution  Major SoC Applications  Summary 39Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 40.
    eS/W: Current applicationcomplexity  Set-top box: >1 million lines of code  Digital audio processing: >1 million lines of code  Recordable DVD: Over 100 person-years effort  Hard-disk drive: Over 100 person-years effort In multimedia systems  S/W cost (licenses) 6X larger than H/W chip cost  eS/W uses 50% to 80% of design resources  eS/W now an essential part of SoC products 40Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 41.
     Speech SignalProcessing .  Image and Video Signal Processing .  Information Technologies  PC interface (USB, PCI,PCI-Express, IDE,..etc) Computer peripheries (printer control, LCD monitor controller, DVD controller,.etc) .  Data Communication  Wireline Communication: 10/100 Based-T, xDSL, Gigabit Ethernet,.. Etc  Wireless communication: BlueTooth, WLAN, 2G/3G/4G, WiMax, UWB, …,etc 41Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 42.
    • Consumer devices, •Networking, • Communications, and • other segments of the electronics industry. microprocessor, media processor, GPS controllers, cellular phones, GSM phones, smart pager ASICs, digital television, video games, PC-on-a-chip 42Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 43.
    43Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd.
  • 44.
    44Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd.
  • 45.
    45Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd.
  • 46.
    Systems on chipare everywhere Technology advances enable increasingly more complex designs Central Question: how to exploit deep-submicron technologies efficiently? 46Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 47.
     Introduction  Whatis SoC ?  SoC characteristics  Benefits and drawbacks  Solution  Major SoC Applications  Summary 47Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 48.
     Technological advancesmean that complete systems can now be implemented on a single chip .  The benefits that this brings are significant in terms of speed , area and power .  The drawbacks are that these systems are extremely complex requiring amounts of verification .  The solution is to design and verify re- useable IP . 48Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 49.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 49
  • 50.
    Delivered by: Subhash Iyer, ProgramHead, Soft Polynomials (I) Pvt. Ltd., Nagpur (CDAC ATC) Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 1
  • 51.
    Introduction to SoCDesign Aspects 2Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 52.
     At eachlevel of circuit abstraction, the circuit is equivalent and performs the same target operation, but its structural components (and hence the component’s granularity) are different, and the design issues may be different Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 3
  • 53.
     Embedded applicationsin multimedia, wireless communications or networking domain were implemented on Printed Circuit Boards (PCBs).  Composed of discrete Integrated Circuits (ICs)  General Purpose Processors  Digital Signal Processors  Application Specific Integrated Circuits  Memories  Further peripherals.  Communication between discrete processing elements and memories is realized by shared bus architectures (like PCi Express) Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 4
  • 54.
     The transitionis from board level integration towards System-on- Chip (SoC) implementations of embedded applications.  Today multiple heterogeneous processing elements and memories can be integrated on a single chip  Increased performance  Reduced cost  Improved energy efficiency  This trend originates from tremendous increase in features as well as the multitude of co-existing standards.  Resulting functional complexity clearly promotes Software enabled solutions to achieve the required flexibility and cope with the demanding time-to-market conditions.  However, stringent energy efficiency constraints of mobile applications and cost sensitive consumer devices prohibit the use of general purpose processors.  Tight cost and performance requirements of versatile embedded systems lead to application specific heterogeneous multi- processor architectures Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 5
  • 55.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 6  Classical vertical partitioning approach to HW/SW Codesign, where the performance critical parts are implemented as dedicated HW blocks and the rest is executed in SW, is no longer applicable.  Instead HW/SW Co-design can be seen as:  Multi-dimensional horizontal mapping problem of an application running on a heterogeneous multiprocessor platform.  During the mapping process,  Exploit application inherent parallelism to achieve performance at reasonable cost.  For the computationally intensive portions of typical embedded applications the extraction of Task Level Parallelism (TLP) is mostly straight forward:  The partitioning into a set of loosely coupled functional blocks can be naturally derived from the algorithmic block diagram
  • 56.
     Two majoraspects  Processing : A set of processing elements has to be provided for the efficient execution of the functional tasks.  Communication mapping: The inter-task data exchange has to be mapped to a communication architecture.  Only a joint consideration of architectural choices in both areas bears the opportunity for near optimal quality of results.  Recent architectural advances offer a huge design space with enormous potential for optimization Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 7
  • 57.
     Bus paradigmas inherited from the PCB era constitutes the major power and performance bottleneck.  Chip-wide communication is envisioned to be handled by full-scale Network-on-Chip (NoC) architectures.  Network-on-Chip architectures  Resolve the physical issues  Address the functional aspects of on-chip communication. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 8
  • 58.
     So far,the dynamic priority based arbitration scheme of shared busses creates a mutual dependency between all components connected to the bus.  Due to this lack of traffic management capabilities every change in the traffic requirements of the application requires a re-design of the bus architecture.  Instead, NoC architectures take advantage of sophisticated networking algorithms to provide elaborated traffic-management capabilities.  By that, the ad-hoc communication mapping is replaced with a disciplined allocation of the required communication services and the on-chip network takes care to provide the required resources.  From the system architecture perspective, this separation of the offered communication services from the architectural resources can be considered as a virtualization of the actual communication architecture.  This virtualization effectively decouples the mapping problem for communication and computation.  The price to pay for the physical and functional benefits of NoC based communication is a significant penalty in terms of chip area as well as transfer latency.Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 9
  • 59.
     Programmable processing elementsachieve significant gains with respect to performance and computational efficiency by:  tailoring instruction set  micro architecture to the respective set of tasks  Examples are innovative architectures exploiting  Instruction Level Parallelism (ILP)  Data Level Parallelism (DLP)  Despite the increased computational performance, the effective performance is often constricted by the communication architecture, since memory accesses latency does not keep pace with the processing power. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 10
  • 60.
     General purposeprocessors resolve the memory access bottleneck by using sophisticated cache and memory hierarchies.  This is generally not applicable for embedded applications due to the poor memory locality of stream driven and packet based data processing.  Instead, processor architectures are equipped with hardware supported Multi-Threading (HW-MT) to perform task switches with virtually no performance overhead.  By that, the application inherent TLP is exploited with the purpose of hiding memory latency, which effectively leads to a significant increase in the processor utilization.  This technique is already widely employed in the network processor domain but recently finds its way into advanced multimedia and signal processing platforms.  In the light of the latency issue caused by NoC architectures, the importance of memory hiding techniques is likely to increase in the future. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 11
  • 61.
     Taking theabove considerations together, future SoCs can be considered as  NoC enabled multi-processor architectures.  On-chip communication backbone connects a large number of heterogeneous processing clusters and global storage elements.  Individual processing clusters consist of one or few application specific programmable kernels together with tightly coupled instruction and data memories as well as local peripherals. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 12
  • 62.
     To copewith the resulting design complexity:  Achieve virtualization of the architectural resources,  They can be allocated by the system architect in a deterministic way.  This virtualization is provided by  NoC approach for communication part  SW and HW operating systems for the control and data processing respectively.  Divide-and-conquer oriented design paradigm  Enables individual optimization of the architectural elements  The price for these benefits  A penalty in terms of chip area,  Generally considered to be of constantly decreasing importance. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 13
  • 63.
     HW/SW Co-designof a given embedded application is defined to  Architect a heterogeneous MP-SoC platform  Allocate the architectural resources for the execution of the application.  Architecture virtualization resolves the mutual dependencies in the mapping process  Trade-offs in the design space still require a joint consideration of application and architecture as well as communication Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 14
  • 64.
     For example: Latency of a more complex on-chip network can be compensated by either:  introducing memory hierarchy  employing hardware multi-threaded processor kernels.  Obviously, the resulting design space is virtually infinite  Architecting and the mapping phase cannot be considered independently without sacrificing quality of results. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 15
  • 65.
     What isneeded is:  A system level design methodology  Corresponding tool supported modeling framework  Transaction-Level Modeling (TLM)  Advocated by the SystemC language  The system level design paradigm  Already incorporated into state-of-the-art Electronic System Level (ESL) tools Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 16
  • 66.
     TLM greatlyimproves  modeling efficiency  simulation speed  Abstracts from  Low-level communication details of the Register Transfer Level (RTL),  To complete transaction  Is usually employed in a byte and cycle accurate fashion  We will look more at packet-level TLM paradigm  Cycle-level TLM is still too detailed to explore large design spaces. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 17
  • 67.
     Since communicationbecomes the driving design paradigm for MP-SoC  Exploration framework is based on a sophisticated, communication centric timing model:  Generic synchronization interface  Defines a concise set of communication primitives,  Follows the Open Core Open Core Protocol (OCP)  Not biased towards any specific communication architecture.  Additionally the primitives incorporate timing-annotation to achieve reasonable timing accuracy at the highly abstract packet-level TLM layer  The communication timing model captures the impact on performance of the interconnection architecture.  This communication timing model supports the full spectrum of available and proposed communication architectures ranging from today’s shared busses to the emerging NoC paradigm. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 18
  • 68.
     Implemented bymeans of a versatile modeling framework for architecture exploration and hardware/software partitioning  Key advantages:  Modeling efficiency  Higher simulation speed  A declarative specification mechanism for better design space exploration Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 19
  • 69.
     TLM isa method used for SoC Design  To specify at a higher level of abstraction  Involves Communication and Computation Architectures  Unified Timing Model aims to standardize the TLM approach Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 20
  • 70.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 21
  • 71.
    Need to knowwhy before what & how!!! Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 22
  • 72.
     Networking Domain Multimedia Domain  Wireless Communications Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 23
  • 73.
     Constitutes implementation of networkingstandards  IEEE, ITU, ETSI, etc work out communication standards  The purpose of these standards to achieve a high degree of interoperability  ISO/OSI reference model has been providing a common terminology Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 24
  • 74.
     Networking layerstandards in the middle of the ISO/OSI stack address a multitude of higher layer application standards as well as lower physical/link layer standards  Major implementation challenge and effort is of the networking layer  Layer three multi-service access switches are considered as one of the potential killer applications for MP-SoC platforms, since they combine the physical wire speed throughput requirements with flexibility constraints imposed by the individual treatment of different service classes and application characteristics.  Today’s de facto networking layer standard is given by the rather simplistic Internet Protocol (IP).  Lower level layers are nowadays built in as ready made blocks  Physical & link layer data rates of core network equipment are imposing demanding performance requirements  Higher application layers are only present in the terminal devices,  So the relatively low to medium throughput requirements allow for a software implementation of the flexible and control dominated functionality. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 25
  • 75.
     Processing ofall kinds of media data  Pictures  Audio  Video decoding  Video pixel processing  2D/3D graphics  Standards enable the exchange of media data as well as device inter-operability  MOPS: Mega Operations Per second Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 26
  • 76.
     Advances inprocessing capabilities and multimedia algorithms together with increased user expectations fuels a constant proliferation of new multimedia standards  Digital audio decoding (AC3, OGG, MP3),  Video decoding (MPEG2, MEPEG4, H.263, H.264, DivX, quicktime)  3D graphic processing (DirectX 9)  Apart from the multitude and dynamics of multimedia standards, a flexible implementation platform is also mandatory to meet demanding cost constraints of converging consumer electronics devices such as the Advanced Set-Top Box (ASTB).  Here the processing and communication fabrics have to be shared among the multitude of supported multimedia applications to limit implementation cost. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 27
  • 77.
     Wireless communicationapplications aggressively use digital signal processing to maximize bandwidth efficiency  Again, a multitude of standards exists  Each marks a local optimum in  implementation cost  Mobility  power dissipation  performance bandwidth efficiency  Multimedia and wireless communication domains are converging into a new generation of Personal Digital Assistant (PDA) or SmartPhone devices  PDAs have started to support a huge variety of travel and fun related applications with much higher processing requirements, like e.g. localization, navigation, travel assistant, video camera, digital camera, picture editing, MP3 player or games  Additionally, this kind of portable, multimedia enabled PDA devices are obliged to support multiple communication standards, both cable (USB, FireWire) and wireless (3G, WLAN). Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 28
  • 78.
     Summary ofcommon trends:  New features and value added services: lead to exponentially increasing processing performance and communication requirements.  Standards become more dynamic and sophisticated and are introduced more rapidly: calls for high flexibility of the SoC implementation to meet the resulting time-in-market as well as time-in-market requirements.  For mobile applications and cost sensitive consumer electronic devices: energy efficiency becomes the prevailing cost factor  Heterogeneous Multi-Processor SoC (MP-SoC) platforms are generally believed to meet the above mentioned conflicting performance, flexibility and energy efficiency requirements of demanding embedded applications  Hence, in the course of an MP-SoC platform design the partitioning of a specific application is a task of major importance Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 29
  • 79.
     Main PartitioningPrinciple  Control dominated domain  Data dominated domain  This first order partitioning has major influence on both the target processing and communication elements as well as on the appropriate design methodology. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 30
  • 80.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 31
  • 81.
     Examples Created bySubhash Iyer for Soft Polynomials (I) Pvt. Ltd. 32
  • 82.
     Control-plane processingis characterized by:  Moderate performance requirements,  Huge amounts of functionality  Calling for maximum flexibility  Developed using an  Integrated Design Environment (IDE) which is  Architecture agnostic  Software centric  Software engineering techniques  Object Oriented Programming (OOP) using  Unified Modeling Language (UML)  C++  Java Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 33
  • 83.
     To increasethe reuse of the control plane Software (across multiple MP-SoC platform generations):  Hardware dependant Software (HdS) portions are wrapped into:  stack of middleware  Real Time Operating System (RTOS)  device driver layers  Parallelism in Control Plane Processing:  Instruction Level Parallelism (ILP)  Extracted by a VLIW compiler  Or a superscalar processor architecture  Helps gain performance  Task Level Parallelism  Generally not possible due to huge amount of functionality Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 34
  • 84.
     Data-plane processingis characterized by:  Computationally intensive data manipulations  Performance at high data rates  Demand for high processing  Demand for high communication performance.  Rapidly evolving standards in all application domains impose increasing flexibility constraints. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 35
  • 85.
     Need toreach performance requirements of networking, multimedia and wireless communications applications  Requires aggressively exploiting abundant inherent parallelism available in data-plane processing tasks because:  Functionality can be straightforwardly partitioned into a set of loosely coupled tasks with well predictable or even cyclo- stationary execution timing  A well confined data set is associated with a single activation of an individual task.  Data sets associated with successive activations of an individual tasks are mostly independent.  These spatial and temporal properties with respect to second order task partitioning and data dependency can already be identified during the algorithm development stage and lead to an identification of coarse grain TLP.  This application inherent TLP enables the concurrent and parallel execution on MP-SoC platforms. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 36
  • 86.
    More about SoCdesign concepts next !!! Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 37
  • 87.
    The mains aspectsof SoC architectural elements Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 38
  • 88.
     Macroscopic metricsfor the classification and evaluation of architectural elements  Cost  Performance  Power Dissipation  Computational Efficiency  Flexibility Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 39
  • 89.
     Cost ofembedded architecture is separated into  Non Recurrent Engineering (NRE) cost for the initial design  Recurring chip fabrication cost.  NRE costs factor is caused by the  Design effort for HW  SW development  Fabrication of the initial mask set.  Typical NRE cost for 90 nm SoC  10-100 Million USD design effort  1 Million USD per mask set  Fabrication cost determined by  Silicon die area  Packaging  Number of pins  Power dissipation requirements Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 40
  • 90.
     Performance ofboth computational and communication architectures is classified into:  Latency  Throughput  Latency  Absolute time passing between the start and completion of a task,  Throughput  Number of accomplished tasks per time.  Communication throughput is measured in bits per second (bps).  Throughput of programmable processing elements is measured in Millions Instructions Per Second (MIPS)  MIPS measurement is not very accurate Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 41
  • 91.
     Measured inWatt  Denotes the energy per time required to operate an embedded system  Is an architecture metric of growing importance  Battery lifetime of mobile devices immediately depends on the energy consumption.  Packaging cost depends on the heat dissipation properties, which in turn depends on the power consumption.  Striving for low power and energy consumption constitutes the key driver for architecture differentiation of embedded SoC platforms Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 42
  • 92.
     Derived fromperformance and power consumption  Characterizes efficiency of a given architectural element with a single value  Computational efficiency of programmable architectures is predominantly measured in MIPS/Watt.  Alternatively measured in energy consumption per task (since MIPS measurement is not very accurate) Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 43
  • 93.
     Related tothe effort to change the functionality of a given architectural element  In contrast to the previous metrics, flexibility can be hardly measured in an accurate way.  Nonetheless, in the context of rapidly evolving functionality and standards of embedded applications, architectural flexibility is of major importance to achieve both decreasing time-to-market as well as increasing time-in-market Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 44
  • 94.
     A processingelement (PE) provides the computational resource to execute a given portion of the application  Dedicated hardware implementation yields best performance  Programmable PEs are controlled by an instruction stream in a highly flexible way  The rather poor performance of programmable PEs has ever fueled computer architecture research towards parallelizing the execution of instructions  Early efforts in parallel computer architectures are classified according to the deployment of control- and data-level parallelism  SISD  SIMD  MIMD  MISD Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 45
  • 95.
     SISD: SingleInstruction Single Data  Traditional von-Neumann kind of computer architectures  Sequentially execute a single instruction stream on a single processing resource  SIMD: Single Instruction Multiple Data  Vector processing machines  Perform a single instruction on multiple data items in parallel  Used in architectures for embedded DSP and graphic applications  Exploit inherent data-level parallelism (DLP)  MIMD: Multiple Instruction Multiple Data  Traditional homogeneous multi-processor type of architectures  Employed in scientific supercomputers  MISD: Multiple Instruction Single Data  Rarely encountered class of architectures,  Exploit temporal ILP by:  Setting pipeline stages  Executing several instructions simultaneously, Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 46
  • 96.
     Superpipelining:  Usesdeep execution pipelines to increase the clock frequency  Superscalarity  Employs parallel functional units and complex dispatcher architectures to dynamically extract Instruction Level Parallelism (ILP)  Very Large InstructionWord (VLIW)  Execute several statically scheduled instructions on parallel functional units,  Hence the effort for ILP extraction is moved into the compiler  Hardware Multi-Threading (HW-MT)  Such architectures are able to concurrently pursue two or more threads of control by providing separate register resources for each thread context  Domain Specific (DS) Instruction Set  Tailors the programmable PE to a specific application domain  Provide specialized functional units.  DS processor examples are Digital Signal Processors (DSPs) employed in multimedia and wireless communications, or Network Processing Units (NPUs) for networking applications Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 47
  • 97.
     The applicabilityof the above listed performance improvement techniques depends on the considered set of target applications.  Superpipelining and Superscalarity are heavily used in high performance General Purpose Processor (GPP) architectures to increase single thread performance of arbitrary applications on the vast expense of silicon area and power dissipation.  On the one hand, embedded applications are severely energy and cost constrained, but still have significant performance and flexibility requirements.  The most promising approach to jointly optimize flexibility and performance is to exploit coarse-grain TLP instead of ILP and map the loosely coupled tasks to individually optimized PEs.  This kind of embedded PEs mostly rely on the more power aware performance optimization techniques, like VLIW, multi-threading and a domain specific or even application specific instruction set. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 48
  • 98.
     MIMD controlparallelism plays an important role in embedded SoC architectures  Parallel execution of specialized PEs offers  Chance for improving application performance  Without sacrificing power efficiency Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 49
  • 99.
     Refers tothe multiple instantiation of identical PEs  Corresponds to a single chip implementation of the MIMD principle  Homogeneous multi-processing of general purpose embedded micro controllers  Achieves the performance scaling required for control-plane processing portion of embedded applications  Also found for dataplane processing in domain specific MP-SoC platforms, where the identical instruction set of the PEs is tailored to a certain application domain Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 50
  • 100.
     Employs multiplePEs  Different PEs individually tailored to a certain task or task set  Dedicated optimization  Applicable for the data-plane processing as it allows for a manual and static task allocation  The high degree of specialization in heterogeneous multi- processing further optimizes computational efficiency for a well defined set of target applications at the expense of generality Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 51
  • 101.
     Parallel execution Requires multiple computational resources  More than one task can be active at the same point in time.  Concurrent execution  Interleaved processing of several tasks on a single resource,  At any time only one task can be active Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 52
  • 102.
     Benefit ofconcurrent execution is depicted in figure  2 tasks are mapped to a single processing element  Both tasks are divided into 2 processing portions  These are separated by a communication request  After Δtdelay the processing of the first portion is finished and the task is blocked for Δtresponse until the request is accomplished.  Instead of wasting the processor resource during this period, the processor context is swapped to the second task by a scheduler.  Utilization of the processor is increased and the request latency is hidden Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 53
  • 103.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 54
  • 104.
    The mains aspectsof SoC on-chip communication elements Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 55
  • 105.
     Basic cost,performance, power, and flexibility metrics apply.  Additionally, Quality of Service (QoS) metrics known from the networking application domain are of increasing importance to manage complex on-chip traffic  The scalability of the communication architecture gains growing attention Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 56
  • 106.
     Bus basedon-chip communication paradigm is derived from the Printed Circuit Board (PCB) domain.  Examples:  VME (Versa Module Eurocard bus)  PCI (Peripheral Component Interconnect)  Advantages:  Easy programming model  High flexibility  Abundant availability of Intellectual Property (IP)  Suited for small and medium scale embedded systems where a small number of blocks exchange moderate amounts of data. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 57
  • 107.
     Implement master-slavecommunication scheme,  Active initiators along with passive target modules are hooked to a shared communication medium  Typical masters:  Processors  DMA controllers  Autonomous ASIC blocks,  Typical slaves:  Memories  Co-processors  Other peripherals  Other components:  Arbitration units: Grant the access to the communication medium to one of the competing master modules  Decoder units: Activate the target module based on the actual address and the address map, which maps the target modules into the bus address space Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 58
  • 108.
     Bandwidth  Isthe premier performance metric  Denotes the maximum transfer capacity of the bus  Available bandwidth is measured in bits per second  Corresponds to the number of parallel data wires divided by the bus clock period Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 59
  • 109.
     Pipelining:  Wellknown technique to improve the communication throughput  Clock frequency is limited by the critical path  Inserting an additional pipeline stage into the critical path allows a higher clock frequency  Yields a higher communication bandwidth  Since the address decoder is usually integral part of the critical path, bus transactions in high performance buses are executed in separate address and data stages Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 60
  • 110.
     Burst modes: Improve communication throughput for the linear access of subsequent addresses by a single master  Address counter is incremented automatically  Next data item is transferred with every cycle without renewed arbitration Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 61
  • 111.
     Unidirectional datalinks  Distinguish on-chip buses from most on-board buses  The latter are based on tristate data wires to maximize the utilization of expensive on-board wires Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 62
  • 112.
     Hierarchy  Commonbus systems separate high performance from low performance communication  Two buses with different speed characteristics Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 63
  • 113.
     Multilayer busarchitectures  Provide dedicated point-to-point connections between distinctive initiators and targets to eliminate bandwidth bottlenecks  Required de-multiplexer at the initiator side is called input stages, the respective target multiplexer is called output stage Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 64
  • 114.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 65  Crossbar bus architectures:  Provide multiple parallel resources between initiators and targets  Significantly improve the traffic throughput  Degree of parallelism may vary from partial crossbar to full crossbar architectures, where the latter provides an individual resource for each connected target
  • 115.
     Arbitration:  Canbe based on various algorithms,  Simple round robin  Fixed, Configurable or dynamic priority schemes  Static or Dynamic Time Division Multiple Access (TDMA).  Even more advanced algorithms are known to further improve the quality of service. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 66
  • 116.
     Locking ofa bus:  By a single master is a necessary feature to support read-modify-write kind of semaphore operations.  This feature is required by most micro-controller architectures, which run operating systems  Split transaction buses  Allow the master to issue multiple requests without waiting for a response, i.e. request and response are separated  Out-of-order execution  Improves the bus throughput by reordering the sequence of responses, depending on the availability of the slave component  This feature requires advanced state-machines in the master modules to cope with non-deterministic sequence of responses Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 67
  • 117.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 68
  • 118.
     Physical Issues. Implemented using a standard cell based semi-custom implementation flow  Transmission wires are not physically optimized,  timing closure issues and unreliable communication links.  Examples of physical effects are crosstalk noise, electromagnetic interference, and radiation-induced charge injection  Synchronous Design.  Most current bus architectures require all connected modules in a single clock domain.  Due to the parasitic capacities of long bus wires, strong driver transistors are necessary to achieve timing closure  Leads to power dissipation  Future SoC designs will follow the Globally Asynchronous Locally Synchronous (GALS) paradigm,  Chip-wide wires will span multiple clock domains, which disqualifies bus architectures as the future chip-level transport mechanism Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 69
  • 119.
     Traffic Management. Due to the rather simple arbitration mechanisms, shared buses provide only rudimentary traffic management support.  Since the communication pattern highly depends on the spatial and temporal execution of the application tasks, meeting the individual QoS requirements like throughput, jitter, or ordering of the respective tasks is very challenging.  This also causes the poor scalability of bus-based communication infrastructures, since every change in the traffic profile of one part of the application and every additional component influences the other parts and requires renewed balancing of the bus architectures.  Interoperability.  Although simple standard peripherals, like DMA, IRC, or memories are available for respective bus systems, it is a tedious and error-prone task to adapt complex IP blocks to a specific bus architecture.  So far efforts to create standard bus interfaces, have not been successful Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 70
  • 120.
     Alternative on-chipcommunication concepts To cope with the limitations of shared bus architectures forms the Networks on Chip (NoC) design paradigm  Aims to replace current adhoc wiring of IP blocks with a disciplined approach where full- scale on-chip networks provide communication services according to the ISO/OSI reference model  Problems in on-chip communication like signal integrity issues, link reliability, or Quality of Service (QoS) are separately resolved on the respective OSI layer Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 71
  • 121.
     The fourlower layers of the are of interest  Physical Layer  deals with the electrical aspects of the data transmission  E.g. signal voltages, clock recovery, and pulse shape  Data Link Layer  provides a reliable data transfer over the physical link.  Error detection by means of block codes and error correction mechanisms like:  Automatic Repeat Request (ARQ)  Forward Error Correction (FEC)  Network Layer  implements the arbitration algorithms, buffering strategies and flow-control mechanisms  So, the networking layer has dominant impact on the performance and functional behavior of network.  Transport Layer protocols  establish and maintain end-to-end connections.  The transport layer manages rate-based flow control, performs packet segmentation and reassembly, and ensures message ordering  This abstraction hides the topology of the network, and the implementation of the links that make up the network Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 72
  • 122.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 73
  • 123.
     The challengein the development of Network- on-Chip architectures is to combine the know- how from both the networking and VLSI domain.  Also the users of on-chip networks have to understand basic networking principles:  First the system architect has to specify design time parameters of the selected NoC architecture like topology, buffer sizes, arbitration algorithm.  Later the platform programmer has to configure runtime parameters like priorities, routing tables, buffer management thresholds to take advantage of the capabilities Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 74
  • 124.
     Transport layeris the first to provide services which are independent of the implementation of the network  Enables the platform programmer to develop embedded software independently from the interconnect architecture  A key ingredient in tackling the challenge of decoupling the computation from communication  Interaction with the network becomes deterministic, rather than prognostic or reactive like in today’s bus based communication architectures  For complex multi-hop networks it is difficult to provide uniform Quality of Service (QOS) guarantees like lower bandwidth bounds, or packet ordering for the complete on-chip traffic  To combine high resource utilization with high QoS requirements of certain traffic types, researchers in the field of computer networks distinguish guaranteed services and best effort service classes Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 75
  • 125.
     Guaranteed Services Require resource reservation for worst-case scenarios  Can be expensive as guaranteeing the throughput for a stream of data implies reserving bandwidth for the peak throughput, even when its average is much lower.  So, resources are often underutilized  Best-effort Services  So not reserve any resources, and hence provide no guarantees.  Best-effort services utilize resources well as they are typically designed for average-case scenarios instead of worst-case scenarios.  Are also easy to configure,  Require no resource reservation  Main disadvantage: unpredictability of the effective performance Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 76
  • 126.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 77  Networking layer is implemented by the routing nodes of the NoC.  Router based network implementations classified as:  Switching Mode  Routing Mode  Queuing  Congestion Control
  • 127.
     Switching mode: Circuit switching  Connections are set up by establishing a conceptual physical path from a source to a destination.  Links can be shared between two connections only at different points in time, by using the time-division multiplexing (TDM) scheme  Packet switching  Data is divided into packets and every packet is composed of a header and the payload.  The header contains information that is used by the router to switch the packet to the appropriate output port Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 78
  • 128.
     Routing mode:applies to packet-switched networks and defines the way packets are transmitted and buffered between network nodes  Store-and-forward  An incoming packet is received and stored entirely before it is forwarded to the next node.  Worm-hole routing  An incoming packet is forwarded as soon as the packet header is evaluated and the next router guarantees that the complete packet will be accepted.  In case the next hob is blocked, the packet tail potentially blocks other resources  Virtual cut-through  An incoming packet is forwarded as soon as the next router guarantees, that the complete packet will be accepted.  In case the next hob is blocked, the packet tail is stored in a local buffer Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 79
  • 129.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 80  Queuing: Buffering strategies can be distinguished by the location of the buffers inside the router.  In the following, N denotes the number of bi-directional router ports.  Input queuing:  A router has a single input queue for every incoming link.  Suffers from the so-called head-offline blocking problem, i.e. the router utilization saturates at about 59%,  Weak link utilization.  Output queuing: `  There are N output queues for every outgoing link resulting in N2 queues.  Yields optimal performance,  The costly N2-fold storage and wiring effort prohibits the implementation for a large number of ports  Virtual output queuing:  Combines the advantages of input queuing and output queuing  Avoids the head-of-line blocking problem.  Each input port maintains a separate queue for each output port  Key factor in achieving high performance using VOQ switches is the scheduling algorithm
  • 130.
     Congestion control: Packet switched networks without mechanisms for bandwidth reservation may run into resource contention and subsequent buffer overflow.  Several solutions prevent packets from entering until contention is reduced  Packet discarding: Simply drops packets in case of buffer overflow  Credit based flow control: Packet loss is prevented in a deterministic way by either signaling congestion via separate wires (back-pressure) or the receiver regularly informs the sender about the available buffer space (window).  Rate based flow control: the sender gradually adjusts the traffic generation rate in response to control flow messages from the receiver. Rate based flow control has to be implemented by the transfer layer and potentially suffers from instability due to long control loops Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 81
  • 131.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 82
  • 132.
     Architectural trends Set the stage for the discussion of appropriate system level design methodologies  Processing elements  Requirements for performance, power efficiency and flexibility  SIMD, VLIW, super-pipelining, and hardware multi- threading exploit application inhérent instruction-, data- , and task-level parallelism  Communication: Bus Architectures Vs Network-on- Chip  Virtualization of architectural resources enables ’divide-and-conquer’  Embedded control-plane processing tasks are executed in the user space the Real Time Operating System (RTOS),  Embedded data-plane processing tasks are executed on HW multi-threaded processing elements  Global communication of control- and data-plane processing elements is performed by elaborated on-chip networks Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 83
  • 133.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 84
  • 134.
    Delivered by: Subhash Iyer, ProgramHead, Soft Polynomials (I) Pvt. Ltd., Nagpur (CDAC ATC) Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 1
  • 135.
    High Level Synthesis LowPower Design 2Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 136.
     At eachlevel of circuit abstraction, the circuit is equivalent and performs the same target operation, but its structural components (and hence the component’s granularity) are different, and the design issues may be different Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 3
  • 137.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 4
  • 138.
     System level: Highest level circuit abstraction  The system is specified as processes and tasks  A mix of hardware and software.  Concerned with overall system structure and information flow.  Computer systems are described as an interconnected set of processors, memories and switches  Behavioral level, algorithmic level or high level  Also called as instruction set level or algorithmic level.  Focus is on the computations performed by an individual processor; i.e., the way it maps sequences of inputs to sequences of outputs  Architecture, microarchitecture, RTL  Viewed as a set of interconnected storage elements and functional blocks.  Behavior of the system is described as a series of data transfers and transformations between the storage elements  Microarchitectural-level representation of the chip resources, such as adders and subtractors, is determined along with decisions such as single-cycle, multicycle, pipelined or superscalar implementation Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 5
  • 139.
     Logic level System is described as a network of gates and flip-flops,  Behavior is specified by logic equations  Circuit is represented in the form of a netlist at which level logic realizations of functional blocks are determined  Circuit or transistor level  Circuit is a netlist of transistors  Decisions such as how and what types of transistors will be used, complementary CMOS, pass transistors, etc. are the main issues  Physical or layout level  System is specified in terms of the individual transistors of which it is composed  Behavior of the system can be described in terms of the network equations  Lowest level of circuit abstraction  Chip is a sequence of layers (masks), each layer of which is composed of polygons.  It is this level that is transferred to the manufacturing process Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 6
  • 140.
     Design automationterminology,:  Optimization  Synthesis  Analysis  In circuit analysis, the behavior or characteristics of a circuit are studied  The task of synthesis is to take the specifications of the behavior required for a system and a set of constraints and goals to be satisfied and to find a structure that implements the behavior while satisfying the goals and constraints  Behavior, structure and physical design: 3 domains in which hardware is described  “Behavior”:  Refers to the ways in which the system or its components interact with their environment (mapping from inputs to outputs)  interest is in what a design does, not in how it is built  “Structure”  Refers to the set of interconnected components that constitute the system (described by a netlist)  Focus on constraints, such as area, cost and delay.  “Physical” design  Mapping of the structure onto the technology  Ignores what the design is supposed to do and binds its structure in space or to silicon Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 7
  • 141.
     The automaticdesign process of VLSI circuits is called synthesis Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 8
  • 142.
     System-synthesis processpartitions the tasks into hardware, software and their communications  High-level synthesis process is the translation from behavioral description to its equivalent structural description  Logic synthesis is the process of mapping from the design at the RTL to a gate-level representation that is suitable for input to physical design Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 9
  • 143.
     Physical designthen addresses aspects of chip implementation  Floor planning  Placement  Routing  Extraction  Performance analysis  Output of physical design is the handoff (“tapeout”) to manufacturing  A generalized data stream, GDSII, stream file  Verification of correctness  Design rules  Layout versus schematic  Constraints (timing, power, reliability, etc.) Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 10
  • 144.
     During eachphase of the synthesis process, the functional equivalence of two consecutive phases is to be checked to ensure that they are functionally the same  A power and timing analysis study can be done by using compact models at the transistor level  At the physical level, more accurate power and time analysis is possible through the extraction of accurate parasitics Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 11
  • 145.
     High-level synthesisis the translation process from a behavioral description to a structural description Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 12
  • 146.
     Analogous to“compilation” that translates a high-level language program in C/C++ to an assembly language program  HLS Also known as behavioral-level synthesis or algorithmic-level synthesis.  Constraints to be considered in HLS are:  Area  Performance  Power consumption  Reliability  Testability  Cost.  HLS synthesis allows a design engineer to make decisions at an early stage of the design cycle, thus ensuring correct design.  Typical steps involved are scheduling, binding, allocation, etc. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 13
  • 147.
     Advantages:  Continuousand reliable design flow  From system-level abstraction to RTL abstraction automatically without manual handling  Automatic translations from high-level specifications in the form of C or SystemC to RTL description of the circuit in the form of VHDL or Verilog.  Shorter design cycle  More automation: faster designs, lesser cost  Fewer errors  Synthesis process can be verified easily, so the chances of errors will be smaller.  Correct design decisions at the higher levels of circuit abstraction can ensure that the errors are not propagated to the lower levels, which are too detailed and costly to correct  Easy and flexible to search the design space  Synthesis system can produce several designs in a short time  So, the designer has more flexibility to choose the proper design considering different trade-offs of power, leakage, area and delay.  Balanced degree of freedom for power optimization  Power and performance optimization can be performed at any level of circuit abstraction  As the level of abstraction goes lower, the complexity of the circuit increases  Additionally, the degrees of freedom, and thus power reduction opportunities decrease  Hence, high level or behavioral level is an attractive level and provides a balanced degree of freedom for design space exploration.  Documenting the design process  Automated system can track design decisions and their effects  Design debugging and continuation by third parties can be easily done  Useful for macrocell-based design and the sale of designs as intellectual property cores  Availability of circuit technology to more people  Design expertise is moved into synthesis systems  It becomes easier for a non-expert to produce a chip that eets a given set of specifications  Cost of manpower required reduces Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 14
  • 148.
     The high-levelsynthesis process takes a system in the form of a hardware description language (HDL) as input and generates an optimal RTL description by:  Compilation  Transformation  Scheduling  Allocation  Binding  Other steps  Power optimization  Leakage optimization  Register optimization  Interconnect optimization  Take place in synthesis either sequentially or along with the fundamental steps  No fixed sequence for performing various high-level synthesis tasks  They are independent of each other  Yet, these tasks should be performed simultaneously for effective optimization Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 15
  • 149.
     The behaviorof a system to be synthesized is usually specified at the algorithmic level using a high-level programming language like C/C++ or a hardware description language (HDL) such as VHDL and Verilog.  The behavior of the system is then compiled into internal representations, which are usually data flow graphs (DFGs) and control flow graphs (CFGs).  Each behavioral specification is transformed into a unique graphical representation.  The DFG is a directed graph that represents data movement, whereas the CFG is a directed graph that indicates the sequence of operations. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 16
  • 150.
     In thetransformation step, the initial DFG is transformed so that the resultant DFG is more suitable for scheduling and allocation.  These transformations include compiler-like optimizations such as dead-code elimination, common sub-expression elimination, loop unrolling, constant propagation and code motion.  In addition, some hardware-specific transformations like minimization of syntactic variances and retiming may be applied to take advantage of the associativity and commutativity of certain operations Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 17
  • 151.
     Scheduling isthe process of partitioning the set of arithmetic and logical operations in the DFG into groups so that the operations in the same group can be executed concurrently, while taking into consideration possible trade-offs between the total execution cost and hardware cost.  A group of concurrent computations to be executed simultaneously is referred to as a control step.  The total number of control steps needed to execute all operations in the DFG, the minimum number of functional units of each type to be used in the design and the lifetimes of the variables generated during the computation of operations are determined in the scheduling step. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 18
  • 152.
     Selection isthe process of choosing resources from the library, which involves tradeoffs according to different features like delay, area, power and leakage.  Resource allocation is the process of determining the number of functional units of each type for performing operations, memory units (registers) for storing data values and interconnects for data transportation.  Often, the selection and allocation processes are a single task.  Allocation is further divided into sub-tasks, such as functional unit allocation, memory unit allocation and interconnect allocation.  Resource allocation and binding may share resources so that the same hardware can be used to execute different operations or so that the same register can be used to store more than one variable. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 19
  • 153.
     Binding orassignment is the process of assigning variables to memory units and data transfers to interconnections.  Binding is further divided into several sub-tasks, such as functional unit binding, memory unit binding and interconnect binding.  Functional unit binding involves the mapping of operations in the behavioral description into a set of selected functional units.  Memory unit binding maps data carriers (constants, variables, arrays) in the behavioral description onto storage elements (read-only memories, registers, memory units) in the data path.  The interconnect binding task maps every data transfer in the behavior onto a set of interconnection units for data routing. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 20
  • 154.
     In theoutput generation phase, design output is generated.  The output should be in a form such that logic-level synthesis tools can optimize the combinational logic and layout synthesis tools can design the chip geometry.  The generated output is generally in a low- level HDL, such as structural VHDL Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 21
  • 155.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 22
  • 156.
     Data PathSynthesis  Control Synthesis  The controller is typically a finite state machine that is either microcoded or hardwired Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 23
  • 157.
     HLS isimportant for several reasons  Reduction of design cycle time  Rapid design space exploration at the higher level of abstraction  Wrong decisions are not propagated to lower levels of design abstraction,  HLS involves several important steps, such as:  Scheduling  Allocation  Binding  Several graph theoretical algorithms are available that can perform optimization while performing these tasks.  Two Types  Data path  Control synthesis  There are existing tools to perform high-level synthesis explicitly, and some tools perform the behavioral to RTL compilation as an intermediate process. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 24
  • 158.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 25
  • 159.
    Delivered by: Subhash Iyer, ProgramHead, Soft Polynomials (I) Pvt. Ltd., Nagpur (CDAC ATC) Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 1
  • 160.
    Introduction to SoCDesign Methodology 2Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 161.
     Design flowof integrated circuits  Application phase  Implementation phase  Both are decoupled  Application to implementation  A specification document written by:  Application team  System architecture specialist  Ad-hoc and informal approach Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 3
  • 162.
     Problems  Ambiguityof the informal specification document leads to misinterpretations and implementation errors  Lack of reliable performance information before the implementation often causes an over- or under-provisioning of processing and communication resources  Quality of results mainly depends on the intuition and experience of the system architect  Manual creation of the verification environment requires significant effort and again represents a potential source of inconsistencies with the original design intend Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 4
  • 163.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 5  Electronic System Level (ESL)  Application is jointly considered with the system architecture to find a feasible and cost effective application to architecture mapping  The declared goal of ESL design is to increase the engineering productivity and quality of results during the specification of the MP-SoC platform architecture and application mapping
  • 164.
     New designparadigm to cope with the:  complexity  economics of the emerging billion-transistor System-on-Chip era.  Architecture centric definition  We define platform-based design as the creation of a stable microprocessor-based architecture that can be rapidly extended, customized for a range of applications, and delivered to customers for quick deployment  Design process based definition  The general definition of a platform is an abstraction layer in the design flow that facilitates a number of possible refinements into a subsequent abstraction layer in the design flow Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 6
  • 165.
     Multiple, almostorthogonal phases  Functional phase  Performed by application specialists  Completely agnostic to architectural considerations.  Includes  Embedded SW development of the control-plane portion  Data-plane algorithm development  The latter is carried out using highly application domain specific tools and methodologies  MP-SoC platform phase  All designs tasks, which have to be performed under consideration of the full functional and architectural complexity the MP-SoC platforms  Example  Specification of the system-architecture  Mapping of the application onto the MP-SoC platform  Development of the hardware dependant Software layers  High-level IP creation phase  Design of processing elements (RISC, DSP, MCU, ASIPs)  On-chip interconnect technologies (busses, NoC),  Somain specific standard I /O (PCI-variants, SPIx variants, HyperTransport, I2C, FireWire, QDR, etc.),  Creation of well defined ASIC IP blocks (e.g. an MPEG4 video codec).  Not completely orthogonal to the functional phase, since the design of application specific processing elements and communication IP indeed depends on the considered application  Semiconductor technology and basic IP creation phase  Covers standard cells, I/O, memories and the basic technology processes supporting them.  More heterogeneous technologies, combining embedded DRAM, embedded Flash, mixed-signal BiCMOS, RF, and analog  More to do with fabrication technologies Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 7
  • 166.
     Represent theresults of the functional phase as a well defined application model as the Executable Specification of the system  System architecture needs to be defined in terms of mapping the application model to the hardware (Main Task)  Embedded SW development  Hardware-Software co- verification task: RTL is verified along with embedded software  Methodology used: Transaction Level Modeling (TLM) Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 8
  • 167.
     Engineering ofintegrated circuits has always employed models on different levels of abstraction  Model: unique, idealized description of the considered system  Degree of abstraction characterizes the type of model used in the respective design phase  Goal of abstraction is to provide a description the system,  which is simple enough  yet sufficiently accurate to enable the necessary investigations  take design decisions  proceed to the next design phase.  Indeed, the design-flow of an embedded system can be considered as a sequence of steps which successively reduce the degree of abstraction in the system model Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 9
  • 168.
     Functionality refersto the modeling of the system behavior  On the highest level of abstraction, the functionality is condensed to pure mathematic expressions.  Later the functionality is refined to operators,  Finally mapped to logic gates  Timing model captures the temporal properties of the system  Degree of abstraction ranges from causality of events to physical timing of transistors and wires  Data representation  Higher level data resolution is reduced to Tokens and Abstract Data Types (ADT)  Lower levels employ word or bit representations.  The Component granularity describes the finest resolution of the sub-blocks  First the component resolution is restricted to coarse-grain building blocks,  Finally the complete embedded system is composed of fine-grain silicon transistors. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 10
  • 169.
     Creation ofa system model requires:  Modeling language  Well defined execution semantic coordinating the activation of the individual blocks  Model of Computation (MoC) is composed of two parts:  Coordination language describes basic execution semantics with respect to properties like parallelism, synchronism, reactivity and provides the abstracted communication mechanism  The host language provides the language elements for the specification of the system models Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 11
  • 170.
     Characterized bythe total temporal ordering of all occurring communication events  Example is the discrete event simulation MoC, which defines the execution semantics for HDL simulators  Further examples of timed MoCs are synchronous languages like Esterel, Lustre, or Signal, where the events of all communication signals are constrained to occur at identical time stamps  Thanks to their sound mathematical foundation, synchronous languages have gained adoption for the specification, analysis and code-generation of reactive control-dominated applications Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 12
  • 171.
     Characterized bythe fact, that communication events are only partially ordered  However, various untimed MoCs are popular for the specification of both data and control dominated applications  Data-Flow MoCs are heavily employed for algorithmic modeling and analysis of signal processing applications  Communicating Sequential Processes (CSP) and Calculus for Communicating Systems (CCS) are prominent untimed MoCs which are based on sequential processes that communicate using a rendezvous communication mechanism. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 13
  • 172.
     The definitionof a proper MoC has long been considered to be the silver bullet for system level design and by that for the solving of the design productivity crisis  Initially, the complete system functionality is to be created using the ideal MoC, which provides highest modeling efficiency, simulation speed, and smooth IP reuse  Next, the initial specification would be automatically verified using formal verification technology and metrics like determinism, causality, dead-lock absence, consistency, completeness, and fairness. The golden system specification would then provide the foundation for an automated path to design space exploration to take functional and architectural design decisions  Finally, system level synthesis would be applied to the partitioned system specification providing an automated path to implementation. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 14
  • 173.
     Object OrientedProgramming (OOP) is a powerful abstraction mechanism,  Data and functionality is partitioned and encapsulated inside classe  OOP based languages: UML,C++, or Java  Widely adopted in engineering of arbitrary SW  Gaining importance for the specification of embedded control-plane processing  OOP components interact primarily by sequentially transferring control through method calls  Sequential nature of OOP hinders the intuitive specification, analysis and refinement of the inherent parallel data-plane processing tasks Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 15
  • 174.
     For thispurpose the actor-oriented abstraction scheme has been conceived, where parallel objects interact by sending and receiving messages  Within an actor-oriented design environment, the designer can focus on the specification and analysis of the algorithmic behavior of the individual tasks whereas the communication and synchronization aspects are handled by the underlying parallel Model of Computation  SystemC allows Actor Oriented Programming Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 16
  • 175.
     Actor-based designlanguages achieve high modularity in communication modeling by using the Interface Method Call (IMC) principle  IMC mechanism is realized by A set of language elements for  Modules  Ports  Interfaces  Channels.  Processes modeling the behavior are wrapped into modules and access communication services through ports  Available methods are  Declared in the interface specification  Implemented by the channel  Thus the access methods in an interface reflect the specialized properties of the communication style implemented by an particular channel  Actor-oriented design languages offers a generic Model of Computation, which in case of SystemC is based on an event driven simulation kernel  Channels serve as containers for communication and synchronization  The user can extend the generic MoC by creating his own methodology specific channel library Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 17
  • 176.
     Challenge ofSystem Level Design  The architecture definition and application mapping have to be considered jointly by taking the full functional and architectural complexity into account  In case of a fixed target platform, SLD is reduced to the application mapping task, which as a synonym term is also called the partitioning of the application Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 18
  • 177.
     Orthogonalization ofconcerns with respect to all modeling attributes generally enables a divide-and-conquer approach to System Level Design  Separation of interfaces and behavior according to the interface based design paradigm fosters successive communication and structural refinement as well as IP reuse  High modeling efficiency and simulation speed is mandatory to handle the high complexity of SoC designs  Incorporation of hardware specific concepts like timing, reactivity, parallelism, and determinism to express the impact of the platform architecture  Incorporation of software specific concepts like Object Oriented Programming, Operating System (OS) encapsulation, Inter Process Communication (IPC), process concurrency, as well as the creation, mutual preemption, and termination of processes to enable smooth integration of the embedded Software part.  Support for Verification and Validation verification, to first gain evidence on the highest possible level of abstraction, that the correct system is being developed and all performance and cost requirements are met (validation). Later, the validated specification should be reused as a golden reference model for the subsequent refinement, IP integration and implementation steps (verification).  Seamless transition between design phases and abstraction levels from system to gates to avoid long iteration cycles caused by gaps in the design flow. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 19
  • 178.
    Question Remains -- - How to do it??? Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 20
  • 179.
    More design aspects Createdby Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 21
  • 180.
     HW/SW Co-simulationhas been recognized as a necessary ingredient for HW/SW Co-design.  First HW/SW Co-simulation prototypes linked Hardware Description Language (HDL) simulators to an ISS (Instruction Set Simulators) executing the Software part.  Soon, HDL/ISS Co-simulation environments like became commercially available and are still idely employed.  This HDL/ISS approach is severely limited by the slow simulation speed of the HDL simulator, especially in case of large systems with several ISSes and significant hardware portions.  The concept of flexible hardware abstraction levels has been developed,  Here accuracy can be traded against simulation speed. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 22
  • 181.
     Maximum simulationspeed can be achieved by using compiled ISS technology together with highly abstract functional SystemC models of the hardware part Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 23
  • 182.
     The originalgoal of HW/SW Co-design was to reach the same degree of tool automation known from RTL synthesis, i.e. a formalized system specification is automatically partitioned and synthesized to the optimal target architecture  automated HW/SW partitioning and System Synthesis have never gained industrial relevance  Partitioning decision metric is restricted to worst case execution time,  Other important metrics like average performance, cost, and power dissipation are not taken into account.  Even the worst case execution time proved to be hard to estimate in the general case of parallel, data dependent, and interleaved software execution  HW/SW partitioning and automated synthesis is still not recognized as a dominant issue  system architects are interested in the impact on performance of a specific target architecture  To partly automate this mapping,  Communication Synthesis  HW/SW Interface Synthesis emerged as new branches of HW/SW Co-design Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 24
  • 183.
     Techniques forthe analysis of communication requirements and synthesis of the communication architecture  As of today, Communication Analysis and Synthesis techniques need further advancement to cope with emerging Network-on-Chip architectures.  One attempt is to instantiate the NoC library elements (routers, network interfaces, links) from a high-level view of the SoC floorplan  Selection of the actual library elements can be in different ways:  In a application-centric approach, the network topology can be generated from a communication graph of the application  In an architecture-centric approach, the communication architecture can be refined from an abstract channel view via a network topology view towards a micro-architecture view .  So far the analysis of Network on Chip architectures is performed using handcrafted simulation models, which are mostly based on SystemC  The absence of standardized APIs, abstraction levels and modeling frameworks beyond the plain SystemC language so far hinders the creation of interoperable IP models for NoC architectures.  Some of the current projects working on a unified modeling environment for the exploration of NoC architectures are discussed in section 5.3.3 below. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 25
  • 184.
     Here, thedesigner decides on the partitioning and architecture mapping  The realization of these decisions are supported by automating the tedious task of generating the required Software driver functions as well as the Hardware glue-logic  Recently the technology has been ported to the SystemC Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 26
  • 185.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 27
  • 186.
     MP-SoC platformphase is concerned with:  System architecture specification  Application mapping  Abstraction concepts on this level have to support the joint consideration of application and architecture  High level of detail inherent to Register Transfer Level (RTL) implementation models prohibits the investigation and optimization across heterogeneous communication and processing elements  Significant research has been spent on the definition of the appropriate System Level Design language.  Today SystemC is generally considered as the standard language for all kinds of SLD tasks. Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 28
  • 187.
     SystemC hasinitially been conceived to replace VHDL and Verilog as a Hardware Description Language  For this reason it naturally provides all hardware specific concepts e.g., time, parallelism, and hierarchy  With version 2.0 SystemC has been thoroughly revised to become a fully elaborated actor oriented design language  The incorporated Interface Method Call (IMC) principle enables a clean separation of interfaces and behavior as well as orthogonalization of further modeling attributes  All kinds of methodology and application domain specific Models of Computation (MoC) can be implemented on top of the generic event-driven SystemC simulator  SystemC 2.0 enables a smooth transition from functional phase to the MP-SoC platform phase, e.g. hybrid simulation of an architecture model in the context of an algorithmic Data-Flow model Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 29
  • 188.
     Since SystemCis a native C++ library, it inherently supports Object Oriented Programming  Final version 2.1 of the language has become an official IEEE standard  Development of the Transaction Level Modeling (TLM) kit  Synthesizable subset of SystemC Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 30
  • 189.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 31
  • 190.
     The characteristicproperty of TLM:  Pin-level communication interface of RTL models replaced by a set of interface methods.  This IMC based communication mechanism is provided by all actor-oriented specification languages Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 32
  • 191.
     SystemC basedTLM has demonstrated the potential in terms of increased simulation speed and modeling efficiency  The basic TLM API consist of a bidirectional transport and a set of unidirectional put and get interfaces  The bidirectional transport has blocking synchronization  Implementation of the interface is allowed to call wait(.)  The unidirectional interfaces are available in a blocking and a non-blocking version  These interfaces can be seen a foundation layer for the creation of more advanced TLM interfaces, which serve a specific methodology or model a specific communication protocol Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 33
  • 192.
     The twocycle-level TLM layers  Bus Accurate (BA)  Cycle Callable (CC)  These levels are particularly suitable to create a cycle-accurate prototype of the system architecture  The (usually cycle- accurate) Instruction Set Simulators (ISS) of the programmable architectures are connected to cycle- and bit-accurate models of memories, communication resources and peripherals Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 34
  • 193.
     BA andCC difference:  BA captures a transaction within a single method call,  CC models provide separate methods for every phase of a transaction.  The Programmer’s View (PV) abstraction levels address early integration of (usually instruction accurate) ISSes for SW development purposes  PV provides a bit and address-map accurate view of the MP-SoC architecture context for the programmable processing elements  PV is based on the bidirectional blocking transport API Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 35
  • 194.
     The OpenCore Protocol International Partnership (OCP-IP) is getting a lot of traction throughout the industry  OCPIP provides a high configurable SoC protocol and their System Level Design working group has worked from the early days on Transaction Level Modeling Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 36
  • 195.
     Lowest level:Transaction Layer 1 (TL1)  provides a fully cycle accurate model of the OCP protocol  Fully aligned with the CC abstraction level from OSCI.  Next higher level: Transaction Layer 2 (TL2)  Represents basically a cycle-approximate abstraction of the OCP protocol.  The API contains a large number of OCP specific features  like e.g. thread-busy, handshaketiming, or sideband signals.  The timing is not cycle accurate, but can be annotated to a near-cycle accurate level  Highest Level: Transaction Layer 3 (TL3)  protocol agnostic subset of TL2  API is limited to a concise set of primitives,  Model timing approximate on-chip communication Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 37
  • 196.
     PV TLMplatforms for early SW development as well as cycle-level TLM for HW/SW and TLM/RTL co- verification are successfully deployed throughout the industry  However, both use-cases solve only parts of the challenges during the MP-SoC design phase  Especially the architecture definition and task partitioning is not adequately addressed  PV platforms simulate very fast and are well suited for SW development  Unfortunately they do not contain sufficient timing information for architectural investigations  The blocking semantics of the underlying bidirectional transport API hinders the smooth annotation of further timing information Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 38
  • 197.
     Cycle accuratemodels of the SoC platform are too detailed and too slow for architecture definition and task partitioning  First, the effort to create such a cycle-accurate model of the complete platform is way too high to allow for the investigation of a large number of architecture and application mapping alternatives  Second the reachable simulation speed in the order of 100k cycles per second is not sufficient for the analysis of large design parameter choices  As a result, the exploration of broad design spaces is still a cumbersome process in cycle-level TLM based design flows  Cycle-level TLM communication models have architecture specific interfaces.  Thus, every time the designer is inclined to explore a new communication architecture he has to change the interface of the connected functional models Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 39
  • 198.
     For thisreason the Design Space Exploration framework deploys a generic synchronization interface, which provides the same primitives as the newly standardized OCP TL3 API  Obviously, the TL3 API presents the best fit for this purpose  It is compliant with the OSCI TLM standard  Additionally, it is of reasonable complexity, and yet offers sufficient expressiveness to meet the accuracy requirements for design space exploration  By deploying SystemC based Transaction Level Modeling the framework is nicely integrated into the flourishing ESL ecosystem.  This method is interoperable with the PV and cycle- accurate modeling methodologies and can benefit from the commercial tool support, available IP models, and established ESL design methodologies Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 40
  • 199.
     Component BasedDesign  Ffounded on the assumption, that the processing elements and communication templates are available IP blocks  Communication Based Design:  envisions MP-SoC platform design as a composition of reusable IP blocks  Different from Component Based Design  Omits the consideration of processing elements  Is exclusively focused on the conceptualization and implementation of the communication architecture.  Communication Based Design can be seen as the corresponding design paradigm to match emerging NoC architectures.  Design Space Exploration (DSE) Environment  The goal is to take early design decisions with respect to system architecture and application mapping on the basis of an abstract performance model.  The embedded application needs to be modeled together with the MP-SoC architecture at a high level of abstraction Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 41
  • 200.
    Introduction to SoC DesignSpace Exploration (DSE) Methodology 42Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd.
  • 201.
     Ultimate goalis to meet the System Level Design requirements as specified and to cope with the full architectural complexity of emerging MP-SoC architectures Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 43
  • 202.
     MP-SoC Frameworkfollows the y-chart principle  Set of functional application models is merged with a set of architecture models in a dedicated mapping step  Developed embodiment of the y-chart principle is called Virtual Architecture Mapping (VAM) which comprises of:  Well defined abstraction level above cycle-level TLM for efficient modeling of embedded applications  Set of generic, parameterizable architecture models, which capture the notion of shared and resource limited architectural fabrics for communication and computation  Rigorous definition of a timing model, that embodies the performance of a selected application-architecture- mapping  MP-SoC simulation framework featuring a declarative mapping mechanism to minimize turn-around times during the iterative architecture exploration cycle  Comprehensive set of analysis tools for functional and performance validation Created by Subhash Iyer for Soft Polynomials (I) Pvt. Ltd. 44
  • 203.
    Created by SubhashIyer for Soft Polynomials (I) Pvt. Ltd. 45