2. Introduction to Computer Architecture
Embedded computing system:
any device that includes a programmable computer
but is not itself a general-purpose computer.
Take advantage of application characteristics to
optimize the design:
don’t need all the general-purpose additional features
6. Microprocessor varieties
Microcontroller: includes I/O devices, on-board memory.
Digital signal processor (DSP): microprocessor optimized
for digital signal processing.
Typical embedded word sizes: 8-bit, 16-bit, 32-bit.
7. Characteristics of embedded systems
Sophisticated functionality.
Real-time operation.
Low manufacturing cost.
Low power.
Designed to tight deadlines by small teams.
8. Functional complexity
Often have to run sophisticated algorithms or multiple
algorithms.
Cell phone, laser printer.
Often provide sophisticated user interfaces.
9. Real-time operation
Must finish operations by deadlines.
Hard real time: missing deadline causes
failure.
Soft real time: missing deadline results in
degraded performance.
Many systems are multi-rate: must handle
operations at widely varying rates.
10. Non-functional requirements
Many embedded systems are mass-market items that
must have low manufacturing costs.
Limited memory, microprocessor power, etc.
Power consumption is critical in battery-powered
devices.
Excessive power consumption increases system cost even
in wall-powered devices.
11. Why use microprocessors?
Alternatives: field-programmable gate arrays (FPGAs),
custom logic, etc.
Microprocessors are often very efficient:
can use same logic to perform many different functions.
Microprocessors simplify the design of families of
products.
12. The performance paradox
Microprocessors use much more logic to
implement a function than does custom logic.
But microprocessors are often at least as fast:
heavily pipelined;
large design teams;
aggressive VLSI technology.
13. Platforms
Embedded computing platform: hardware
architecture + associated software.
Many platforms are multiprocessors.
Examples:
Single-chip multiprocessors for cell phone baseband.
Automotive network + processors.
14. The physics of software
Computing is a physical act.
Software doesn’t do anything without hardware.
Executing software consumes energy,
requires time.
To understand the dynamics of software
(time, energy), we need to characterize the
platform on which the software runs.
15. Characterizing performance
We need to analyze the system at several
levels of abstraction to understand
performance:
CPU.
Platform.
Program.
Task.
Multiprocessor.
16. Challenges in embedded
system design
How much hardware do we need?
How big is the CPU? Memory?
How do we meet our deadlines?
Faster hardware or cleverer software?
How do we minimize power?
Turn off unnecessary logic? Reduce memory
accesses?
17. Design goals
Performance.
Overall speed, deadlines.
Functionality and user interface.
Manufacturing cost.
Power consumption.
Other requirements (physical size, etc.)
19. Top-down vs. bottom-up
Top-down design:
start from most abstract description;
work to most detailed.
Bottom-up design:
work from small components to big system.
Real design uses both techniques.
20. Requirements
Plain language description of what the user
wants and expects to get.
May be developed in several ways:
talking directly to customers;
talking to marketing representatives;
providing prototypes to users for comment.
21. Functional vs. non-functional
requirements
Functional requirements:
output as a function of input.
Non-functional requirements:
time required to compute output;
size, weight, etc.;
power consumption;
reliability;
etc.
22. UML
Object-oriented design.
Unified Modeling Language (UML).
Object-oriented (OO) design: A generalization of object-
oriented programming.
Object = state + methods.
State provides each object with its own identity.
Methods provide an abstract interface to the object.
23. UML object
d1: Display
pixels: array[] of pixels
elements
menu_items
pixels is a
2-D array
comment
object name
class name
attributes
25. pipeline is a set of data processing elements connected in series.
output of one element is the input of the next one.
The elements of a pipeline are often executed in parallel or in
time-sliced fashion
27. von Neumann architecture
Memory holds data, instructions.
Central processing unit (CPU) fetches instructions from
memory.
Separate CPU and memory distinguishes programmable
computer.
CPU registers help out: program counter (PC), instruction
register (IR), general-purpose registers, etc.
30. von Neumann vs. Harvard
Harvard can’t use self-modifying code.
Harvard allows two simultaneous memory
fetches.
Most DSPs use Harvard architecture for
streaming data:
greater memory bandwidth;
more predictable bandwidth.
31. RISC vs. CISC
Complex instruction set computer (CISC):
many addressing modes;
many operations.
Reduced instruction set computer (RISC):
load/store;
pipelinable instructions.
34. Assembly language
One-to-one with instructions (more or less).
Basic features:
One instruction per line.
Labels provide names for addresses (usually in first column).
Instructions often start in later columns.
Columns run to end of line.
35. ARM assembly language example
label1 ADR r4,c
LDR r0,[r4] ; a comment
ADR r4,d
LDR r1,[r4]
SUB r0,r0,r1 ; comment
Pseudo-ops
Some assembler directives don’t correspond directly to instructions:
Define current address.
Reserve storage.
Constants.
37. Endianness
Relationship between bit and byte/word ordering defines
endianness:
byte 3 byte 2 byte 1 byte 0
byte 0 byte 1 byte 2 byte 3
bit 31 bit 0
bit 0 bit 31
little-endian
big-endian
38. ARM data types
Word is 32 bits long.
Word can be divided into four 8-bit bytes.
ARM addresses cam be 32 bits long.
Address refers to byte.
Address 4 starts at byte 4.
Can be configured at power-up as either little-
or bit-endian mode.
39.
40. ARM versions
ARM architecture has been extended over several
versions.
We will concentrate on ARM7.
41. ARM status bits
Every arithmetic, logical, or shifting
operation sets CPSR bits:
N (negative), Z (zero), C (carry), V (overflow).
Examples:
-1 + 1 = 0: NZCV = 0110.
0-1 = -1: NZCV = 1000
15+10 = 25: NZCV = 0011.
42. ARM data instructions
Basic format:
ADD r0,r1,r2
Computes r1+r2, stores in r0.
Immediate operand:
ADD r0,r1,#2
Computes r1+2, stores in r0.
43. ARM data instructions
ADD, ADC : add (w.
carry)
SUB, SBC : subtract
(w. carry)
RSB, RSC : reverse
subtract (w. carry)
MUL, MLA : multiply
(and accumulate)
AND, ORR, EOR
BIC : bit clear
LSL, LSR : logical shift
left/right
ASL, ASR : arithmetic
shift left/right
ROR : rotate right
RRX : rotate right
extended with C
46. Example: C assignments
C:
x = (a + b) - c;
Assembler:
ADR r4,a ; get address for a
LDR r0,[r4] ; get value of a
ADR r4,b ; get address for b, reusing r4
LDR r1,[r4] ; get value of b
ADD r3,r0,r1 ; compute a+b
ADR r4,c ; get address for c
LDR r2,[r4] ; get value of c
47. C assignment, cont’d.
SUB r3,r3,r2 ; complete computation of x
ADR r4,x ; get address for x
STR r3,[r4] ; store value of x
48. Example: C assignment
C:
y = a*(b+c);
Assembler:
ADR r4,b ; get address for b
LDR r0,[r4] ; get value of b
ADR r4,c ; get address for c
LDR r1,[r4] ; get value of c
ADD r2,r0,r1 ; compute partial result
ADR r4,a ; get address for a
LDR r0,[r4] ; get value of a
51. DMA
Direct memory access (DMA) performs data
transfers without executing instructions.
CPU sets up transfer.
DMA engine fetches, writes.
DMA controller is a separate unit.
54. System bus configurations
Multiple busses allow parallelism:
Slow devices on one bus.
Fast devices on separate bus.
A bridge connects two busses.
CPU slow device
memory
high-speed
device
bridge
slow device
60. ‘Compute Unified Device Architecture’
– Hardware and software architecture for issuing and
managing computations on GPU
• Massively parallel architecture
– over 8000 threads is common
• C for CUDA (C++ for CUDA)
– C/C++ language with some additions and restrictions
• Enables GPGPU – ‘General Purpose Computing on
GPUs’
CUDA
61. GPU: a multithreaded coprocessor
SM
streaming multiprocessor
32xSP (or 16, 48 or more)
Fast local ‘shared memory’
(shared between SPs)
16 KiB (or 64 KiB)
GLOBAL MEMORY
(ON DEVICE)
SM
SP SP SP SP
SP SP SP SP
SP SP SP SP
SP SP SP SP
SHARED
MEMORY
SP: scalar processor
‘CUDA core’
Executes one thread
62. GDDR memory
512 MiB - 6 GiB
•GPU:
SMs
o30xSM on GT200,
o14xSM on Fermi
For example, GTX 480:
14 SMs x 32 cores
= 448 cores on a GPU
GLOBAL MEMORY
(ON DEVICE)
SM
SP SP SP SP
SP SP SP SP
SP SP SP SP
SP SP SP SP
SHARED
MEMORY
63. •Parallelization
• Decomposition to threads
•Memory
• shared memory, global memory
GLOBAL MEMORY
(ON DEVICE)
SM
SP SP SP SP
SP SP SP SP
SP SP SP SP
SP SP SP SP
SHARED
MEMORY
64. Important Things To Keep In Mind
Avoid divergent branches
Threads of single SM must be
executing the same code
Code that branches heavily and
unpredictably will execute slowly
Threads shoud be independent as
much as possible
Synchronization and communication
can be done efficiently only for
threads of single multiprocessor
SM
SP SP SP SP
SP SP SP SP
SP SP SP SP
SP SP SP SP
SHARED
MEMORY
65. Parallelization
Decomposition to threads
Memory
shared memory, global memory
Enormous processing power
Avoid divergence
Thread communication
Synchronization, no
interdependencies
GLOBAL MEMORY
(ON DEVICE)
SM
SP SP SP SP
SP SP SP SP
SP SP SP SP
SP SP SP SP
SHARED
MEMORY
Editor's Notes
Wars size: number of bits processed by computer’s CPU in one go
Mass-market market for goods that are produced in large quantities.
a non-functional requirement is a requirement that specifies criteria that can be used to judge the operation of a system, rather than specific behaviors. They are contrasted with functional requirements that define specific behavior or functions.
VLSI= very large-scale integration, the process of integrating hundreds of thousands of components on a single silicon chip.
Abstraction = فكزة غامضة ، تعبير تجريدي
مستويات الاستخراج
Plain بسيطط
Describe number of interacting objects rather than one single large block (monoloithic)
Data sets that arrive continuously and periodically are called streaming data.
Bandwidth is also defined as the amount of data that can be transmitted in a fixed amount of time. For digital devices, the bandwidth is usually expressed in bits per second(bps) or bytes per second.
The set of registers available for use by programs is called the programming model,alsoknownastheprogrammermodel
arithmetic overflow has occurred in an operation, indicating that the signed two's-complement result would not fit in the number of bits used for the operation (the ALU width). Some architectures may be configured to automatically generate an exception on an operation resulting in overflow.