Lecture summary: architectures for baseband signal processing of wireless communications systems

Frank Kienle
§  TexPoint fonts used in EMF.
Architekturen für Basisbandsignalverarbeitung
in drahtlosen Kommunikationssystemen
Architectures for baseband signal processing
of wireless communications systems

2
Communications vs. VLSI constraints
Minimize transmit power
Minimize redundancy
Quality of service guaranteed
Quality of ServiceRedundancy (Bandwidth)
Transmit Power
Chip Area/ costs
Processing Power
Desired Decoder
Throughput/Latency
Implementation (VLSI)
constraints
Communications
constraints
Minimize processing power
Minimize chip area/costs
Quality of service guaranteed

3
Exercise
Current high-end handsets have a power consumption of 1 Watt, e.g.
for WCDMA voice call. The battery has a electric charge of 1400mAh.
The base band processor operates on 1.5V, f_cyc=300 MHz and has to
process 20 GOP/s in an active voice call mode.
§  What is the live time of the battery assuming an active voice call?
§  What is the average energy per operation?
§  On average, how many operations are processed per cycle?
§  Is it possible to use a vector DSP core to process this task which needs
20 pJ per operation?

4
Conclusion Input LLRs
Highest priority for hardware design is to fulfill the specification under realistic
conditions
Proper input quantization is essential to avoid communication performance
degradation
An implementation of an ‘optimal’ algorithm can lead to entire different results
in a possible hardware realization.
Robustness of an algorithm is key for a successful hardware integration
For baseband processing: Look for algorithms which are SNR independent!!!

5
Maximum Likelihood decoding
The maximum likelihood ML estimation has the an entire sequence as its
result:
We always use a decoding algorithm to solve the Maximum Likelihood
criterion if it is possible.
§  Convolutional codes à solved by Viterbi algorithm
§  Small block codes à solved brute force, testing all codewords
However, many codes have to large code space to solve the ML
criterion
Solution: divide and conquer methods

6
Symbol-by-symbol MAP
Divide and conquer splits the problem in multiple sub-problems which
can be solved independently. The overall solution is approached by an
iterative exchange of the sub-solutions.
The best solution for the iterative problem solving is to determine a
confidence estimate of each variable (bit).
à symbol-by-symbol maximum a posteriori (MAP) criterion.
§  Turbo decoders split the overall problem in two parts:
§  LDPC decoders split the overall problem in M parts

7
Building Blocks
Arithmetic units
Adder, Multiplier, MAC, Shifter, Comparator, ALU etc.
§  Clear order of complexity avoid e.g. divisions if possible
Memory blocks for the storage of data
Register files, Shift register, FIFOs, RAMs, ROMs, DRAMs.
§  SRAM: we can access one data per clock cycle SRAM
§  Access conflicts!
Interconnection units
Switches, bus, arbiter, network-on-chip
§  Structure of the barrel shifter (used in LDPC and turbo decoders)

8
Memory Hierarchy
SRAM memory can be generated in nearly any shape (VLSI):
§  A memory block can be composed of multiple smaller memory
§  Changes the area and average power
8 bit
4096
Shape
WordDepth x
WordWidth
Average Power Write Operation,
all data input pins and all
addresses are switching
(uW/MHz)
Area [mm2] comment
4096 x 8 7.0857 0.043824
1024 x 8
1024 x 32
1024
8 bit
1024
1024
1024
32 bit
1024
Larger but less
average power
5.229
12.8716
single:
0.01481
all:
0.0592
Smaller and less average
power if access pattern is
possible
0.039683

9
Memories (SRAM) first summary
1.  Often we can trade off area vs. power just by changing the memory
hierarchy.
2.  However, the application determines the access pattern and gives thus
constraints to the memory hierarchy
Access pattern:
sequence in time and space (address) of reading/writing multiple data
Example for a ‘difficult’ access pattern:
§  Read in one clock cycle 100 words each from a different (random) address

10
C
Problem to parallelize random interleavers
A
E
I
M12
0
4
8
B
F
J
N13
1
5
9
C
G
K
O14
2
6
10
D
H
L
P15
3
7
11
L
A
H12
0
4
8
B
O
F
G13
1
5
9
I
J
D
P14
2
6
10
E
N
M
K15
3
7
11
Parallel Processing
Parallel Interleaver
Addr Interl.
Addr.
0 8
1 1
2 4
3 10
4 3
5 9
6 13
7 12
8 2
9 6
10 15
11 0
12 11
13 7
14 5
15 14

11
Viterbi Algorithm (functional units)
The Viterbi algorithm solves the ML criterion:
At each time step at each state we:
§  Add the previous state metric and the corresponding branch metric
§  Compare these two accumulative metrics
§  Select the survivor and store it
00
01
10
11
00
01
10
11

12
+
+
+
+
old
C
S
old
D
S
new
A
S
-‐
-‐
new
B
S
Inner structure Viterbi decoder
Storage for the
previous states
00
01
10
11
00
01
10
11 Branch metric unit
info-
LLR
Memories to store
channel LLR values
parity
LLR
Storage for the
result states
survivor bit
memory

13
Low-Density Parity-Check Code
LDPC Code is a linear block code
§  Defined by a very sparse parity check matrix H
§  x is a codeword if:
LDPC codes can be described by a Tanner graph
§  Variable node associated to a column in H and represents a single bit within x
§  Check node associated to a row in H and represents thus a single parity check code
§  Regular LDPC codes have variable and check nodes of constant degree
§  Irregular LDPC codes have nodes of varying degree
gdc
max gdc
max
1
C o n n e c tivity
M
c h e c k
n o d e s
(C N )
f dv
max f 3 f 2
N
va ria b le
n o d e s
(V N )

14
Summary LDPC codes
LDPC codes are decoded in an iterative manner
§  Probabilistic messages are exchanged between variable nodes and check nodes
§  Decoding algorithm is an instance of a message passing algorithm
§  For practical receivers a maximum of 40 iterations are performed
LDPC decoder can be realized in fully parallel or in partially parallel manner
§  Fully parallel architecture:
§  Each VN and CN is instantiated in hardware the connection is hard wired
§  Pro: highest possible throughput (optical fiber)
§  Con: supports one code, problem due to routing congestions
§  Partially parallel
§  Only P functional VNs and CNs are instantiated,
§  Connectivity is realized by a switching network, connectivity pattern has to be stored
§  Con: limited throughput
§  Pro: large flexibility (code rate, block length)à required by wireless LDPC decoders

15
Summary LDPC codes
Fully flexible LDPC decoder
§  Can process any random LDPC code
§  Storing the connectivity pattern can be more costly (area) than the entire rest of the
decoder (message storage, functional units)
Joint Architecture – Code/Algorithm design
§  Define a hardware architecture
§  Design code/algorithm to fit this architecture

16
Communications point of view
0 1 2 3
0
1
2
3
4
5
6
7
8
9
10
11
4 5 6 7
The parity matrix is composed of:
§  Permuted identity matrices:
§  Already proposed by Gallager 63’ as
construction method
§  Allows compact description, e.g.
P=13 à identity matrix size:
§  Results in quasi-cyclic codes
§  All LDPC codes utilized in standards
are composed of permuted identity
matrices.

17
Hardware design point of view
8
8
8
VN VN VN VN
VN RAM1
CN RAM1
VN RAM2
CN RAM2
VN RAM3
CN RAM3
VN RAM4
CN RAM4
8
4
0
0
0
0
4
4
4
9
9
9
1
1
1
5
5
5
10
10
10
2
2
2
6
6
6
11
11
11
3
3
3
7
7
7
0
4
8
1
5
9
2
6
10
3
7
11
9
5
1
10
6
2
11
7
3
0 1 2 3
0 1 2 3
0 1 2 3
4 5 6 7
7 4 5 6
6 7 4 5
LDPC decoder features:
§  Permuted identity matrices
results in simple shifting networks
§  Size of identity matrix directly
gives a possible hardware
parallelization P
§  Entire connectivity pattern
defined by just two vectors
§  Shift vector
§  Address vector
§  for each clock cycle one entry exist
§  Very regular control flow, always
P data are handled identically

18
Turbo Codes
Turbo Codes (1993):
§  Clever parallel concatenation of two convolutional codes achieving capacity up to 0.5 db
§  Defined from encoder point of view
Parallel Turbo-Codes composed of:
§  Component Encoder (recursive systematic convolutional (RSC) codes
§  Interleaver
§  Puncturing Unit (not shown here)
High level complexity comparison TC vs. CC
§  CC: Lc=9 ⇒ 256 states
§  Turbo Code: 2 CCs with Lc=4 ⇒ 2 x 8 states
§  trellis state reduction by a factor of 16
§  repeated turbo decoding with 8 iterations:
⇒ overall state reduction by a factor of 2 and 3dB coding gain

19
Summary Interleaver
For the interleaver hardware realization we need:
§  interleaver table (e.g. SRAM based)
§  Or an interleaver generator, which delivers the corresponding indices
LTE interleaver realization:
§  Dedicated interleaver generator to calculate:
§  The interleaver pattern is conflict free for a parallel realization
UMTS interleaver realization:
§  Difficult control flow to realize a dedicated interleaver generator
§  Typically SRAMs are instantiated to store the interleaver indices
§  However, the SRAM has to be filled with the corresponding indices
depending on the current block length.
§  The indices are calculated by e.g. an ARM processor

20
Encoder 2
Encoder 1
Iterative Decoding Procedure
Input
Parity 1
Systematic
Parity 2
Symbol-by-symbol
Maximum A Posteriori
Decoder 1
Symbol-by-symbol
Decoder 2
Systematic Parity 1
Interleaved
Systematic Parity 2
T
-
T
-

21
Concatenated codes are known since 1966 (Forney)
New innovation 1993:
subtraction (ignoring) of own old information
à EXTRINSIC INFORMATION PRINCIPLE

22
Symbol-by-symbol
Decoder 1
Symbol-by-symbol
Decoder 2
Systematic Parity 1
Interleaved
Systematic Parity 2
T
-
T
-
symbol-by-symbol
MAP result
input value:
syste. LLR
(additional gain)
extrinsic information
gain from decoder 1
is interleaved and used as
a priori information for decoder 2

23
Max-Log MAP algorithm
1. Branch metric calculation:
2. Forward state metrics α:
computed recursively over k ∈ {1..blocksize-1} for all states m
3. Backward state metrics β:
computed recursively over k ∈ {blocksize-1..1} for all states m
4. Soft-output calculation:

24
MAP decoding: one state per clock cycle step
+
+
old
C
S
old
D
S
new
A
S
-‐
Storage for the
previous states
00
01
10
11
00
01
10
11 Branch metric unit
info-
LLR
Memories to store
channel LLR values
parity
LLR
Storage for the
result states
MAP Algorithm:
state metric memory
e.g. 12 bit per state
and time step

25
Data path, Serial MAP
Memory 1
(input values)
Memory 2
(store intermediate result)
different functions
On vectors

27
C
Where is the problem of this parallel processing?
A
E
I
M12
0
4
8
B
F
J
N13
1
5
9
C
G
K
O14
2
6
10
D
H
L
P15
3
7
11
L
A
H12
0
4
8
B
O
F
G13
1
5
9
I
J
D
P14
2
6
10
E
N
M
K15
3
7
11
Parallel Processing
Parallel Interleaver
Addr Interl.
Addr.
0 8
1 1
2 4
3 10
4 3
5 9
6 13
7 12
8 2
9 6
10 15
11 0
12 11
13 7
14 5
15 14

Lecture summary: architectures for baseband signal processing of wireless communications systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lecture summary: architectures for baseband signal processing of wireless communications systems

Similar to Lecture summary: architectures for baseband signal processing of wireless communications systems (20)

More from Frank Kienle

More from Frank Kienle (12)

Recently uploaded

Recently uploaded (20)

Lecture summary: architectures for baseband signal processing of wireless communications systems