2. History of the quartz electronic watch
• "The watch, almost more than the steam engine, was the real protagonist
of the Industrial Revolution“ (Lewis Mumford, American social philosopher)
• In December 2007, we have celebrated the 40th anniversary of the first
electronic watch, a Swiss quartz watch named Beta developed by the
Centre Electronique Horloger (CEH)
• The first quartz watch was a Swiss wristwatch presented in 1967
• It was exactly 20 years after the invention of the transistor
• In Switzerland, the competencies in low-power electronics come directly
from the watch industry
3. Research in 1962-67: Time Base
• Development of a quartz resonator, very risky project.
• The main problem was the miniaturization of such a resonator at 8 kHz
1967: Beta 2
On the left:
electromecha-
nical part
On the right:
the printed
circuit with
the IC and
the quartz
4. Beta 21 from OMEGA
First quartz watch
"OMEGA Constellation",
Electroquartz, f 8192,
vibrating motor 256 Hz,
caliber 1301 (Beta 21)
Common project from
20 Swiss watchmakers
Analog display, 1972,
width: 37 mm, steel,
réf.: A-32121
Price: Euro 980.--
5. CEH Research Projects after 1967
• Digital adjustment of the quartz oscillator, called Beta 3 and 4, for which
some pulses were removed to achieve exactly 32’768 Hz
• Quartz oscillators, static and dynamic frequency dividers designed as
asynchronous speed-independent circuits
• New quartz, such as the ZT, as well as new displays composed of LED for
analog displays
• ROM, RAM and EEPROM memories and the first RISC-like watch
microprocessors before that the name “RISC” was introduced in 1980 by
Berkeley
Single bipolar integrated circuit,
called ODC-04, feature size 6 m,
containing about 110 components
7. Copyright 2007 CSEM | Titre | Auteur | Page 6
A Swiss History: Watch Microcontrollers
• A watch circuit was 2’000 MOS at the time
• In 1971, 1st microprocessor (4004)
• In 1974, also for watches? Question!
• In 1976, conference in Switzerland on
« microcompressors arrival » !
• In 1978, uP working group with
CEH, Uni Ne, EPFL, watchmakers
• Goal: to find uP architecture specific to
electronic watches, mainly very low power
consumption
Gate Matrix Logic
CSEM 1985
8. Copyright 2007 CSEM | Titre | Auteur | Page 7
Binary Decision Machine (BDM)
• BDD (Binary Decision Diagram)
• EPFL Research of
Prof. D. Mange, LSL
• Two instructions
IF and DO
• One can add:
CALL and RETURN
• Karnaugh Table either in hardware or in software
• In software: BDD, executed by a BDM
• Very simple uP Architecture
1 1 0 1
0 1 1 0
00 01 11 10
0
1
ab
z
c
1 1 0 1
0 1 1 0
00 01 11 10
0
1
ab
z
c
1 1 0 1
0 1 1 0
00 01 11 10
0
1
ab
z
c
1 1 0 1
0 1 1 0
00 01 11 10
0
1
ab
z
c
a
c
c
b
c
b
1 0 1 1 0 0 1
9. Copyright 2007 CSEM | Titre | Auteur | Page 8
Binary Decision Machine
• Instruction format: a single but very long word (similar to RISC, earlier)
• But it was already the case for the first mainframe computers of the fifties
• RISC today
is simply
re-discovering
old technology
ROM
STACK
M
U
X
P
C
+1
MUX
TEST
H
W
H H H
INPUTS
Very Long Word
10. Copyright 2007 CSEM | Titre | Auteur | Page 9
Number of clock cycles: benchmark
• An analysis showed that the number of clock cycles executed by these watch
processors for a watch application was about 100 (each second to increment
the seconds, minutes, hours,… while the number of clock cycles of a
conventional microprocessor like Intel 8048 was about 2'000 clock cycles for
the same task.
• So in energy: 70 times more efficient than a 8048 uP
• Watch uP: Single instruction word, from 12 bits to 18 bits
• Instruction Sets of 6 to 20 instructions (true RISC!!)
• About 20'000 MOS transistor count
• 6 micron technology at the time, it was about 50 mm2 of silicon, a very big
circuit!
11. Copyright 2007 CSEM | Titre | Auteur | Page 10
Architecture
• It was a BDM
and one datapath
• First uP: failure
• Layout too difficult!
• Too complex
• Sagrada Familia
4*PC ROM
512*16
M +1
M
U
X
TEST MUX
SEQUENCER
SP
CO PMUX
Ø4
Ø1
Ø3
POP PUSH
W
#ADD
#MUX
IR15:0
RAM
30*8
PROM
#M
PRAM
Ø4 WRAM
OUTPUT
Ø4
INPUT
RRAM
RRAM
BUS 7:0
SA
ALU
MA
Ø4
Ø3
LA
LA
C
#OP
T2
T1
RESIR
IR15:13
12. Copyright 2007 CSEM | Titre | Auteur | Page 11
The 1st microprocessor: COMBO (1983)
• 800 instructions
• 16 bits instr
• 7 bit data
• 20K MOS
• 40 mm2
• 6 microns
• 1.5 Volt
• 0.4 A 16KHz
13. Copyright 2007 CSEM | Titre | Auteur | Page 12
Comparison
Year Micro techn Nb MOS Address
1971 4004 P-MOS 8 2’300 4K
1972 8008 P-MOS 8 3’500 16K
1974 8080 N-MOS 6 5’000 64K
1976 8085 N-MOS 4 6’000 64K
1978 8086 N-MOS 3 29’000 1M
1982 80286 N-MOS 2.3 130’000 16M
1985 80386 CMOS 2 275’000 4096M
At CSEM:
1985: 2e uP in 4 m with
20 instructions of 17 bits,
24K MOS, 20 mm2, 0.4 A
à 1.5 Volt.
1987: uP ETA, 35'000
MOS
1990: other uP with
100'000 MOS.
14. Copyright 2007 CSEM | Titre | Auteur | Page 13
Competition
• Competition (AMI, Eurosil, Hewlett Packard, Intel, Mitsubishi, National, RCA
and Sharp) have designed watch microprocessors generally consuming more
than 4 or 5 A, so 10 to 100 times more than CEH watch microprocessors
• Electronic digital watches appear around 1975.
• In 1977, the price of a digital watch was set down from more than 100$ to 10$.
• For Christmas1976, TI sold LED watches with 5 functions for $9.95.
• Profits disappear, it was similar to the calculator market and only 3 very big
companies remain in this market: Casio, Seiko and Texas Instruments. TI
decreased its prices to kick-off Casio and Seiko.
• 20 years after, Intel President Gordon Moore had still an old Microna watch
fabricated by Intel (my watch at 30 millions $, he said) to remember this
lesson.
• But Seiko and Casio were stronger than TI to decrease their prices, and it is
TI that was forced to leave this market.
15. Copyright 2007 CSEM | Titre | Auteur | Page 14
Watch microprocessor PUNCH (1990-1993)
• Swiss watchmakers have decided
to design a common watch uP that
has to be the heart of all Swiss
watches
• The choice of the architecture is a
multi-task machines
• This allows us to define several
independent tasks and to execute
them in pseudo-parallelism
• As soon as a task has to start, it is
started immediately
moteur
Contrôle
S M H J
1/1001/10 S M
couronne
automate
1Hz
100Hz
modes
automate
16. Copyright 2007 CSEM | Titre | Auteur | Page 15
Hardware Scheduler TIME MODES MOTOR
TIME MODES MOTOR
scheduler
Parallel Tasks in a Watch
Application
Hardware Scheduler
Microprocessor
Task
1
Task
2
Task
3
Task
4
• Estimation: about 20%
less executed instructions
• 103 assembly instructions
(18 bits)
• data of 8 bits.
• The uP core contains
11'000 MOS
• a complete microcontroller
with its memories presents
about 150'000 MOS.
• 800 MIPS/watt
17. Copyright 2007 CSEM | Titre | Auteur | Page 16
Tasks execution
• The originality of the
PUNCH is based on
tasks that are executed
in pseudo-parallelism,
while executing one
instruction of task1, then
task2, etc… and back to
task1.
• It is also possible to
define 1 to 4 tasks, so it
is also possible to have
a conventional
monotask uP.
Task 1
Task 2
Task 3
Task instructions continuously executed
Delayed starting tasks in a single processor
Principle of the MultiTask Architecture
Instructions of 2 tasks alternatively executed
Same scheme than above for 3 tasks
Starting Task
Task 1
Task 2
Task 3
Multitask Principle
18. Copyright 2007 CSEM | Titre | Auteur | Page 17
Architecture of the multitask PUNCH
ROM
N x 18 bits
Pc 0
Pc 1
Pc 2
Pc 3
Instr. Register
Process Cntrl
scheduler,
stack pointers,
router
EventBank
IOCommunication ExtEvents
21 8
ALU WR
Datapath
Ac0
Ac1
Ac2
Ac3
Ix0
Ix1
Ix2
Ix3
One has to quadruple:
- the PC (program
counter)
- the AC (accumulator)
- the IX (index register)
It results in a reasonable
cost
In monotask mode, one
uses the 4 PC as a
stack
19. Copyright 2007 CSEM | Titre | Auteur | Page 18
Test Chip
• Used in wrist watches
• Other applications
• Belong to watchmakers,
so difficult to give licenses
to other customers
• We think we can do
even better for reducing
power consumption
CoolRISC
20. Copyright 2007 CSEM | Titre | Auteur | Page 19
Punch-based Tissot Two Timer
• It is my watch
21. Copyright 2007 CSEM | Titre | Auteur | Page 20
8-bit CoolRISC Microprocessor
• RISC instructions (single 22-bit word)
• Load/store architecture
• Bank of 16 registers (not possible to implement multitask, 4*16 reg. is too much)
• More than 100 instructions
• Hardware stack, but also Branch & Link (call-return in software)
• 3 stages pipeline, CPI=1 (Clock per Instruction)
• Gated-Clock Technique (not to clock unused blocks)
• Synthesized by Synopsys (I.P. core in VHDL, then logic synthesis)
• CoolRISC core: 20’000 MOS
22. Copyright 2007 CSEM | Titre | Auteur | Page 21
0
21
op<3> cc<3> addr<16>
JUMP addr;
JCC addr;
PC0 <-- addr
if cc then PC0 <-- addr
0
21
op<6> addr<16>
CALL addr;
CALLS addr;
PCn <--PCn-1, PC1<--PC0+1, PC0 <-- addr
IP <-- PC0 +1, PC0 <-- addr
0
21
op<9> 1 1 1 1 1 1 1 1 1 1 1 1 1
CALL IP;
CALLS IP;
RET;
RETI;
PUSH;
POP;
PCn <--PCn-1, PC1<--PC0+1, PC0 <-- IP
IP <-- PC0 +1, PC0 <-- IP
PCn-1 <-- PCn
PCn-1 <-- PCn
PCn <--PCn-1, PC1<--IP, PC0 <-- PC0+1
IP <-- PC1, PCn-1 <-- PCn, PC0 <-- PC0+1
reg <-- reg alu-op data
0
21
op<6>
ALU reg, °data;
data<8>
alu<4> reg<4>
ALU operations:
MOVE
CMOVE
SHL
SHLC
SHR
SHRC
CPL
INC
INCC
DEC
DECC
AND
OR
XOR
ADD
ADDC
SUBD
SUBDC
SUBS
SUBSC
MUL
MULA
MSHL
MSHR
MSHRA
CMP
CMPA
TSTB
SETB
CLRB
INVB
0
21
op<5>
ALU reg, addr;
addr<8>
alu<5> reg<4>
reg <-- reg alu-op data-mem(addr)
0
21
op<5>
ALU regr, reg1, reg2;
alu<5> reg2<4>
regr <-- reg2 alu-op reg1
reg1<4> regr<4>
0
21
op<3>
ALU reg, (IX, offset);
ALU reg, (IX, offset);
ALU reg, (IX, offset);
offset<8>
alu<5> reg<4>
reg <-- reg alu-op data-mem(IX+offset)
reg <-- reg alu-op data-mem(IX), IX <-- IX + offset
reg <-- reg alu-op data-mem(IX-offset), IX <-- IX - offset
IX
0
21
op<5>
ALU reg, (IX, R3);
alu<5> reg2<4>
reg <-- reg alu-op data-mem(IX+R3)
IX
1 1 1 1 1 1
0
21
op<16>
FREQ div;
HALT;
NOP;
div<4>
0
21
op<3> cc<3> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
JUMP IP;
JCC IP;
RETS;
PC0 <-- IP
if cc then PC0 <-- IP
PC0 <-- IP
load/store
conditional
shift left
with carry
shift right
with carry
complement
increment
with carry
decrement
with carry
logical and
logical or
logical xor
addition
with carry
op1 - op2
with carry
op2-op1
with carry
multiply
2-compl. mult
multiple shift
multiple shift
2-compl.
compare
2-compl. cmp
bit test
bit set
bit reset
bit invert
COOLRISC 816 INSTRUCTION SET
Instructions
0
15
MSB LSB
PC
0
15
IP high
0
15
0
15
0
15
0
15
0
7
MSB LSB
ACC
0
7
R0
0
7
R1
0
7
R2
0
7
R3
0
7
status
IX0 high
IX1 high
IX2 high
IX3 high
IP low
IX0 low
IX1 low
IX2 low
IX3 low
23. Copyright 2007 CSEM | Titre | Auteur | Page 22
CoolRISC Pipeline
Fetch &
branch
1 clock cycle
Branch
instructions
fetch
1 clock cycle
Arithmetic
instructions
execute
store
result
- 3-stage pipeline
- no load delay
- no branch delay
24. Copyright 2007 CSEM | Titre | Auteur | Page 23
Branch Instruction executed in one pipeline stage
1 clock cycle
fetch & branch
fetch
alu
the branch
condition
is available
Critical Path:
- ROM Precharge
- ROM Read
- Branch decode
- Addresses multiplexor
However, at 20 MHz, clock
cycle time is 50 ns
One can execute all this
With CPI=1, at 20 MHz, one
has 20 MIPS, it is very good
In 0.18 um, about 100 MHz,
consequently 100 MIPS
25. Copyright 2007 CSEM | Titre | Auteur | Page 24
Bypass in the CoolRISC pipeline
Arith Fetch
branch
one clock
18 bit
RISC one word
instruction
Write
Dec
ALU
RAM
Fetch Branch
condition code
ready
Fetch Dec RAM
RAM Write
Arith
Write
ALU
RAM
bypass
26. Copyright 2007 CSEM | Titre | Auteur | Page 25
CoolRISC 816
PC <16>
ROM (program)
max 64 K
instructions
Branch
Address
16
13
P
C
0
M
U
X
+1
P
C
1
ROM index <16>
PC
2..9
IR1 <22>
Op-code
Control
Unit 2nd
Stage
MUX
ABus <8>
SBus <8>
ALU<8>
CY, Z
RAM Index 2 L
REG1
BBus<8>
Data
RAM Index 2 H
RAM
ROM (data)
and Periph
max 64K
bytes
RomAddr <16>
RomInstr <22>
DataOut <8>
DataIn <8>
RamAddr <16>
ReadNWrite
ChipSelect
PROM
CoolRisc Core 816
ctr
gated
clock
gated
clock
First Pipeline Stage
8
IR2 <22>
C. U. 3rd Stage
CoolRisc 816 Core
Branch Unit
CALL to Interrupt Address
Mulitplier
ACC
U
8 MSB
8 LSB
REG2
RAM Index 3 L
ROM Index L
REG3
Status Register
ROM Index H
RAM Index 3 H
RAM Index 0 L
RAM Index 0 H
RAM Index 1 L
RAM Index 1 H
REG0
27. Copyright 2007 CSEM | Titre | Auteur | Page 26
Microphotography of the CoolRISC
• Technology 1 m, Nov. 1995
• In 0.5 m, about 3000 MIPS/watt
at 3.0 Volts (with memories)
compared to 100 MIPS/watt for an
Intel C51
• In 0.25 m, only the core (20’000
MOS):
• TSMC 0.25m, 2.5 Volt, 60
MIPS
• Power: 1.05 V. , 10 W per
MHz, 100’000 MIPS/watt
28. Copyright 2007 CSEM | Titre | Auteur | Page 27
CPI for some microprocessors
Microcontroller
ST62xx
COP800
8048
Z86Cxx
68HC05
PIC16C5x
Punch
CoolRisc 81
CoolRisc 88
CoolRisc 816
instr.
code
12
12
8
8
11
11
12
12
10
10
bits
code
152
120
112
168
160
132
216
192
180
220
exec.
instr.
60
60
35
35
59
59
74
74
58
58
exec.
clocks
2704
2000
1125
692
226 *
300
296
74
58
58
CPI
45
33
32
20
4 *
5
4
1
1
1
* refered to the internal E frequency that is 2 times slower
than the oscillator frequency
For a given routine: shifting out 8-bit data & clock (synchrone)
29. Copyright 2007 CSEM | Titre | Auteur | Page 28
Number of executed clock cycles
NUMBER OF EXECUTED INSTRUCTIONS
8-bit multiply linear
8-bit multiply looped
16-bit multiply linear
16-bit multiply looped
16-bit division linear
16-bit division looped
CoolRisc 88 PIC 16C5x
Number of instructions
and executed clocks code
executed
code
executed
30
14
127
31
194
36
30
56
127
170
162
213
35
16
240
33
243
27
37
71
233
333
180
227
instr clock
30
56
128
170
162
213
instr clock
148
284
932
1332
760
1108
30. Copyright 2007 CSEM | Titre | Auteur | Page 29
Wisenet Chip uses the CoolRISC
31. Copyright 2007 CSEM | Titre | Auteur | Page 30
Conclusion about Watch Microprocessors
• Huge impact of electronic watches on the development of microelectronics in
Switzerland
• One can say that it is similar for the development of microprocessors in
Switzerland.
• This history shows quite well that first Swiss microcontrollers have been
designed for electronic watches, before to be used for other applications
requiring low power consumption
• What is the largest unused computation power in the world? The answer of
D. Lando, Lucent Technologies: it is all the electronic watches in the world!!!!
33. Digital design
• CSEM has a long history of designing low-power processors
• CoolRISC, licensed by Semtech, Swatch group, TI, ...
• Watch processors: PUNCH (1993), µPUS, Combo (1982), ...
• Powerful new processors with ultra low power consumption
• 2005: Macgic, a 16/24-bit DSP (4 MAC)
• 2006: icyflex1 , a flexible processor for DSP/control applications
• 2009: icyflex2, a smaller processor for control applications
• 2009: icyflex4, a scalable processor for DSP/control applications
Macgic and icyflex are registered trademarks of CSEM
CSEM DSP/MCU jan 2009 | C. Piguet | Page
34. Digital design – low-power processors
• customizable (in VHDL)
• configurable (at run-time)
• Macgic 16/24-bit DSP
• complex datapath (quad MAC)
• very high parallelism (1.5k cycles for a 256 FFT)
• assembler, debugger
• 170 uW/MHz at 1.0 V in 180 nm
• 150’000 equiv NAND gates, 2.1 mm2 in 180 nm
• icyflex1 flexible 32-bit processor
• includes DSP functions (dual MAC)
• high parallelism (ex: 2.6k cycles for a 256 FFT)
• C compiler (gcc), debugger (gdb),...
• 120 uW/MHz at 1.0 V in 180 nm
• 110’000 equiv NAND gates, 1.6 mm2 in 180 nm
icyflexTM
CSEM DSP/MCU jan 2009 | C. Piguet | Page
35. Ongoing processor development
• icyflex2 :
• 50% less area than icyflex1 (removed DSP characteristics)
• Higher frequency (longer pipeline)
• Lower power consumption for control type applications
• Optimized for C compiler
• icyflex4 :
• Scalable architecture for much higher throughput
• Higher frequency (longer pipeline)
• Lower power consumption for DSP/control type applications
CSEM DSP/MCU jan 2009 | C. Piguet | Page
36. Processor positioning
CSEM DSP/MCU jan 2009 | C. Piguet | Page
1 MUL 2 MAC 4 MAC … 36 MAC
icyflex2
Control
Computing
Power
DSP
icyflex1
icyflex4
Macgic
1 MUL 2 MAC 4 MAC … 36 MAC
37. Processor roadmap
CSEM DSP/MCU jan 2009 | C. Piguet | Page
2005 2006 2007 2008 2009 2010
1st prod
Abilis
1st
Si
Macgic
development
1st
prod
1st
Si
icyflex2
dev
1st
prod
1st
Si
icyflex4
dev
1st
Si
icyflex1
dev
38. icyflex processor family overview
Name MAC
(MUL)
C
compiler
Processor Pipeline
length
Instruction
width
Status
Macgic 4 DSP 3 32 Prod.
icyflex1 2 Yes DSP/MCU 3 32 Prod.
icyflex2 (1) Yes MCU 5 32 Dev.
icyflex4 4 + 4*N Yes DSP/MCU 5-7 64 Dev.
CSEM DSP/MCU jan 2009 | C. Piguet | Page
39. MACGIC: Mobile TV Chip (DVB-T/H) by Abilis
CSEM DSP/MCU jan 2009 | C. Piguet | Page
40. This chip contains three MACGIC cores
• Abilis: To become the world leading supplier of
semiconductor solutions of multimode, digital TV receiver
and broadband wireless connectivity for mobile terminals
myTV
41. World first single die DVB demodulator
• Abilis Systems (Kudelski group), Switzerland
• World first single die programmable DVB-T/H demodulator, Aug 2007
• Unique Software Defined Radio architecture
• Manufactured by IBM, RF-CMOS 90 nm technology world leader
• Ultra-low power DSP technology by CSEM
• Multi-band silicon tuner
• World’s smallest DVB-T/H receiver: 5 x 5 mm
• Performance
• Dynamic Echo Handling for best indoors/mobile reception
• Adaptive demodulation
• Meets MBRAI, exceeds NorDig 1.0.3
• Up to -100 dBm sensitivity (8k, 8MHz, QPSK, ½)
CSEM DSP/MCU jan 2009 | C. Piguet | Page
42. World first single die DVB demodulator (cont’d)
CSEM DSP/MCU jan 2009 | C. Piguet | Page
Reconfigurable
AD/DA
converters
RF
receiver
S-band
MCU
Subsystem
(Link Layer)
RISC core
HW
Accelerator
Programmable
OFDM
Engine
RF
receiver
DVB-T/H
DVB-T/H
HW
Accelerator
Cordic
WiFi
Other
WiMAX
DVB-T/H
Host
MPEG stream
(Encoded )
Macgic
DSP
Macgic
DSP
Macgic
DSP
RF Tuner Channel Estimation
& Correction
& Decoding
A/D conversion Link layer
44. Digital design – icyflex1 architecture
CSEM DSP/MCU jan 2009 | C. Piguet | Page
Optimized for minimal power consumption:
• 32-bit instructions
• 3-stage pipeline
• Load/store RISC architecture
• Configurable instructions
45. High level of parallelism with 32-bit instruction words
datapath Load/store
Parallelism datapath-load/store
Up to 10 simple operations in
parallel (2×MUL, 2×ACC, …)
Up to 6 operations in
parallel (2 times load 2
data-words in parallel store,
address generation, …)
Total: up to 16 operations executed in parallel in a single 32-bit instruction
a single 32-bit instruction
CSEM DSP/MCU jan 2009 | C. Piguet | Page
46. icyflex instruction set and addressing modes
• Hardware loop and repeat instructions
• Standard: MUL, ADD, MAC, CMP, MAX, AND,….
• SIMD (Single Instruction Multiple Data): ADD2, MUL2, MAC2, …
• e.g. 2 independent fixed-point MAC in parallel
• Instructions/addressing modes to support C compiler
• Configurable instructions
• Addressing modes for DSP type processing and for a C compiler
• A large variety of addressing modes:
• Ranging from the basic addressing modes: indirect, 1, offset, modulo
• To very complex addressing modes (configurable):
– for instance: an <= (an + om + 8 × OFFA ) % mp
CSEM DSP/MCU jan 2009 | C. Piguet | Page
47. Performance: benchmarks of the icyflex processor
Algorithm on the icyflex processor Clock Cycles
Sum of a vector of N values ~ N/2
Addition or multiplication of 2 vectors of N values ~ N
Norm/mean/standard deviation/clipping of a vector ~ N/2
Minimum/maximum of a vector ~ N/2
Multiplication of 2 matrices of N×M values ~ (N × M) × (N/2+2)
Matrix transposition ~ N × M × (5/8)
FIR filter/convolution ~ ½ per tap
FIR filter/convolution, complex data ~ 2 per tap
IIR filter (biquad) ~ 2 per tap
Complex FFT of N= 64 values ~ 440
Complex FFT of N=256 values ~ 2.6 k
CSEM DSP/MCU jan 2009 | C. Piguet | Page
48. Performance: Comparison with other DSPs
Company / Processor FIR filter
Clock cycles per tap
Complex FFT 256 points
Clock cycles
CSEM / Macgic Audio-I ~1/4 1.5 k
CSEM / icyflex ~1/2 2.6 k
Analog Devices / Blackfin BF531 ~1/2 3.2 k
Texas Instruments / TMS320VC5501 ~1/2 5.5 k
Philips / CoolFlux DSP ~1/2 5.5 k
Analog Devices / ADSP2191M ~1 7.4 k
Motorola / M56F8323 ~1 12 k
MicroChip / dsPIC30 ~1 ~19 k
Texas Instruments / MSP430F14x ~28 ~53 k
CoolRISC 8-bit - ~60 k
MicroChip / PIC18F4220 ~160 3.2 M
CSEM DSP/MCU jan 2009 | C. Piguet | Page
49. Processors designs optimized for energy efficiency
Features Starcore Macgic icyflex1 CoolFlux
Bits per Instruction 128-bit 32-bit 32-bit 32-bit
Data Word width 16-bit 24-bit 32-bit 24-bit
Number of MAC 4 4 2 2
Memory Transfer 8 8 4 2
Operations per cycle 32 32 16 8
Number of equivalent NAND gates 600k 150k 115k 45k
Clock cycles for FFT 256 ** 1'614 1'410 * 2’600 * 5’500
Average Power per MHz @ 1V * 350 µW 170 µW *115 µW * 75 µW
Power per MHz @ 1V for FFT * 600 µW 300 µW *200 µW * 130 µW
Normalized energy for FFT @ 1V 2.3 1 1.2 1.7
**single precision *estimated
CSEM DSP/MCU jan 2009 | C. Piguet | Page
50. Silicon area of the processor core
Processor Equiv NAND gates
Process
0.18 µm 0.13 µm 0.09 µm
icyflex 32-bit (*) 110’000 1.75 mm2 0.70 mm2 0.34 mm2
Macgic Audio-I 24-bit 150’000 2.1 mm2 0.85 mm2 0.41 mm2
Silicon area is dominated by memories in most applications,
or by analog / RF blocks in very deep submicron processes.
* using CSEM’s thick-gate standard cell library
CSEM DSP/MCU jan 2009 | C. Piguet | Page
51. CSEM DSP/MCU jan 2009 | C. Piguet | Page
Software development tools
• GNU C compiler (gcc)
• software implementation of IEEE floating-point standard
• icyflex instruction parallelism supported by latest releases of gcc
• successful pass of whole GNU test suite for all optimization levels
• GNU assembler / linker (binutils)
• BFD / ELF32 object file format
• Binary, SREC, IHEX memory image file formats
• icyflex instruction set simulator (ISS), written in C++
• Phase-accurate, pipelined
• Wrappers to SystemC, VHDL (Modelsim), Matlab/Simulink
• GNU debugger (gdb)
• Mode 1: instruction set simulator of the icyflex core
• Mode 2: On-Chip Debug (OCD) through a JTAG interface
• Eclipse integrated development environment
• CDT C/C++ IDE plug-in
• icyflex plug-in
• Using library of subroutines and DSP subroutines with optimized minimal number of instructions
54. icyflex2 : a trimmed down processor for control apps
Data Move Unit
Data Processing Unit
Accumulate
datapath &
registers
MicroOPeration
datapath &
registers Coprocessor
registers
Program Sequencing Unit
PC
Branch
Flag
Exception
Instr exec/xfer
pc
sbr
in
ex
ec
hd
pf
dm
pb
iel
epl
air
HW loop
lend
lbeg irit
HW stack
slba
scnt sppa
pa
GP registers
r0
r1
r2
r3
r4
r5
r6
r7
X AGU
Y AGU
px0
px2
px4
px6 px7
px1
px3
px5
mx0
mx2
mx4
mx6 cx6
cx0
cx2
cx4
py0 py1 my0 cy0
Host and Debug Unit
Host-side Core-side
Step
stepc
Host register
access
hrs
hrd
Debug engine
dcr
ddr
Config/Status
csr
P Break
X Break
Y Break
2 ALU 2 Multipliers
CSEM DSP/MCU jan 2009 | C. Piguet | Page
59. CSEM DSP/MCU jan 2009 | C. Piguet | Page
icyfirst : first integration of icyflex1 in Nov 2006
• This integration targets very low
leakage for applications requiring
limited processing power
• Standard cell library: thick gates for
leakage reduction by 800x
• Measured speed: 400 kHz @ 1.1 V
• Avg dyn power: 140 W/MHz @ 1.1 V
• Core: 115k eq. gates, 1.75 mm2
• Peripherals: 110k eq. gates, 1.5 mm2
• Memory: 10.7 mm2
• TSMC 180 nm generic CMOS
SRAM 2 kiWords of 32-bit
60. CSEM DSP/MCU jan 2009 | C. Piguet | Page
icyfirst : first integration of icyflex1 in Nov 2006 (cont’d)
• icyflex1 core
• 128 KiBytes SRAM
• Clock generator
• Voltage regulator
• POR, watchdog, timers
• Request controller
• DMA and bus controllers
• 2 x 16 bit GPIO
• 2 x I2C
• 2 x SPI
• 2 x I2S
• JTAG controller
icyflex1
61. icycam: a System-on-Chip for vision applications
• icyflex1 runs at up to 50 MHz
• QVGA CMOS pixel array (320 x 240)
• 14 um pixel pitch
• logarithmic encoding of luminance
• close to 7 decades of intra-scene dynamic
range encoded on 10 bits
• graphical coprocessor
• SRAM: 128 KiBytes
• DMA, SPI, PPI, GPIO, UART, SDRAM, JTAG
• Tower Semiconductor, 180 nm, CIS
CSEM DSP/MCU jan 2009 | C. Piguet | Page
62. icycom: a System-on-Chip for RF applications
• icyflex1 runs at up to 3.2 MHz
• RF: 865 ~ 915 MHz, FSK (incl. MSK, GFSK), 4FSK, OOK, OQPSK
• TX: 10 dBm
• RX: -105 dBm at 200 kb/s (BER = 10-3)
• Power management
• Power supplies for external devices
• Low power modes: multiple standby modes
• 10 bit ADC
• SRAM: 64 KiBytes (with MBIST)
• DMA, RTC, Timers, Watchdog, I2C, SPI, I2S, GPIO, UART, JTAG
• TSMC, 180 nm, generic
CSEM DSP/MCU jan 2009 | C. Piguet | Page
63. icycom: a System-on-Chip for RF applications (cont’d)
CSEM DSP/MCU jan 2009 | C. Piguet | Page
Power
Management
In: 1.0 to 1.8 V
or 2.2 to 3.6 V
Out: Vin, 2.7 V
1.2 to Vin -0.1
icycom chip
A/D
Interfaces
Program &
Data
Memory
icyflex1
IO
supply
RF
External
Component
IO EEPROM
IO
64. References
• C. Piguet, "Binary-decision and RISC-like machines for semicustom design",
Microprocessors and Microsystems, Vol 14, No 4, May 1990, pp. 231-240.
• J-F Perotto, C. Lamothe, C. Arm, C. Piguet, E. Dijkstra, S. Fink, E. Sanchez, J-P
Wattenhofer, M. Cecchini, "An 8-bit Multitask Micropower RISC Core", JSSC Vol. 29,
No 8, August 1994, pp. 986-991.
• C. Piguet, J.-M. Masgonty, C. Arm, S. Durand, T. Schneider, F. Rampogna, C.
Scarnera, C. Iseli, J.-P- Bardyn, R. Pache, E. Dijkstra, "Low-Power Design of 8-bit
Embedded CoolRISC Microcontroller Cores", IEEE JSSC, Vol. 32, No 7, July 1997, pp.
1067-1078
• C. Piguet, “The First Quartz Electronic Watch”, invited talk at PATMOS, Sevilla, Spain,
September 11-13, 2002.
• C. Arm, J.-M. Masgonty, M. Morgan, C. Piguet, P.-D. Pfister, F. Rampogna, P. Volet;
“Low-Power Quad MAC 170 W/MHz 1.0 V MACGIC DSP Core”, ESSCIRC 2006,
Sept. 19-22. 2006, Montreux, Switzerland
• [ Copyright 2007 CSEM | Titre | Auteur | Page 63
65. References
• C. Arm, S. Gyger, J.-M. Masgonty, M. Morgan, J.-L. Nagel, C. Piguet, F. Rampogna, P.
Volet, « Low-Power 32-bit Dual-MAC 120 mW/MHz 1.0 V icyflex DSP/MCU Core”,
ESSCIRC 2008, Sept. 15-19, 2008, Edinburgh, Scotland, U.K.
• C. Piguet, « History of the Development of Swiss Watch Microprocessors », IEEE
SSCS NEWS, Summer 2008, Vol. 13, No. 3, pp. 50-55.
• Christian Piguet, Jean-Luc Nagel, Vincent Peiris, Stève Gyger, Daniel Séverac, Marc
Morgan, Jean-Marc Masgonty, « Low-Power Heterogeneous Systems-on-Chips”,
Journal of Low Power Electronics JOLPE, Vol. 4, No 2, pp.111-126, August 2008
Copyright 2007 CSEM | Titre | Auteur | Page 64