1
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Case study:
Performance-efficient Implementation of
Robust Header Compression (ROHC)
using an Application-Specific Processor
Gert Goossens, Patrick Verbist,
Erik Brockmeyer, Luc De Coster
Synopsys
2
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Agenda
1. Robust Header Compression (ROHC) in network
processing
2. Application-Specific Processor (ASIP) methodology
3. Accelerating control processing in ROHC
4. Accelerating data processing in ROHC
5. Conclusions
3
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
ROHC in Network Processing
ROHC compressor
• 1.2 Mpackets/s
• 600MHz clock  500 cycles/packet
− Header Parser: ~100 cycles/packet
− Encoder+Context+CRC: ~400 cycles/packet
• Optimize for worst-case control path
High Performance Streaming Data (IP/UDP/RTP Protocol)
IP Header
20-40 bytes
UDP Hdr
8 bytes
RTP Header
12 bytes
Payload
Video/Audio…
ROHC Header Payload
Video/Audio…
ROHC
Compressor
ROHC
DecompressorRadio or
Cable Link
Header
Parser
Header Field
Encoder
Packet
Modification
Buffer
Feedback
Buffer
Context
Processor
CRC
Con-
Text
Mem
4
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Header
Parser
Header Field
Encoder
Packet
Modification
Buffer
Feedback
Buffer
Context
Processor
CRC
Con-
Text
Mem
ROHC Implementation
█ Blocks requiring efficient control-flow
 Tiny microprocessor with efficient branching and logic operations
█ Blocks requiring efficient control-flow and data processing
 Tiny microprocessor with hardware-accelerated instructions
ASIP technology enables the design of such processors
5
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Agenda
1. Robust Header Compression (ROHC) in network
processing
2. Application-Specific Processor (ASIP) methodology
3. Accelerating control processing in ROHC
4. Accelerating data processing in ROHC
5. Conclusions
6
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
ASIPs in SoC Design
ASIP architectural optimization space
Parallelism Specialization
Instruction-
level
parallelism
Data-
level
parallelism
Task-
level
parallelism
Orthogonal
instruction
set (VLIW)
Encoded
instruction
set
Vector
processing
(SIMD)
Multi-
core
Applic.-
specific
data types
Applic.-
specific
instructions
Connectivity &
storage matching
application’s
data-flow
App.-spec.
data
processing
App.-spec.
memory
addressing
App.-spec.
control
processing
Distributed
regs,
sub-ranges
Multiple
mem’s,
sub-ranges
Jumps, subroutines,
interrupts, HW
do-loops, residual
control, predication
Direct, indirect,
post-modification,
indexed,
stack indirect…
Any exotic
operator
Integer,
fractional,
floating-point,
bits, complex,
vector…
Single or
multi-cycle
Relative or absolute,
address range,
delay slots
Pipeline
Multi-
threading
Pipeline
depth
Hazards:
HW/SW stall,
bypass
Micro-
processor
Extensible
Processor
Application-Specific
uP / DSP
Programmable
Datapath
Hardwired
Datapath
7
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
“ASIP Designer” Tool-Suite
8
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Agenda
1. Robust Header Compression (ROHC) in network
processing
2. Application-Specific Processor (ASIP) methodology
3. Accelerating control processing in ROHC
4. Accelerating data processing in ROHC
5. Conclusions
9
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Control Processing
• Architectural exploration with ASIP Designer
• Starting point: “Tmicro” CPU
– 16-bit gen.-purpose CPU (already leaner than 32-bit)
– Variable-length instructions: arithmetic (16), move
(16, 32), load/store (16, 32), control (16, 32, 48)
Customization of a 16-bit CPU: “Strip Down & Beef Up”
• End point: “Tnano” ASIP
– 16-bit stripped CPU
– Fixed-length instructions: arithmetic,
move, load/store, control (16)
– No multi-word decoding overhead
– Improved clock frequency
– Add compact control instructions to
accelerate ROHC code
– Predicated execution (Selection)
– Field extraction (Masking)
– Shortcut logic instructions
10
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Control Processing
Control Path Balancing
Longest control path
Shortest control path
• Example: Control-Flow
Graph of Header Parser
• Improve control path
balancing by
– C source code
re-factorization
– User-control on code
hoisting
– Predicated execution
in tail of long control
paths
11
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Control Processing
If-Else, No Predication Tmicro (gen.-purp. CPU)
nML
Conditional jump
instruction,
2-cycle branch
penalty
C
Condition at
tail of long
control path Machine code
Conditional jump
with branch penalty:
One of two delay
slots filled, one
‘nop’ left
12
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Control Processing
Predication Tnano (optimized ASIP)
nML
Select
instruction
C
Condition at
tail of long
control path
Machine code
• Conditional code
executes always
• Result is used
selectively
 No branch penalty
nML
Predication
Threshold
13
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Control Processing
If-Else with Multiple Tests Tmicro (gen.-purp. CPU)
nML
Stand-alone compare
instruction
C
“If-else” with
multiple tests
Machine code
Multiple compare and c-jump
instructions
Slow in worst-case
14
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Control Processing
If-Else with Multiple Tests Tnano (optimized ASIP)
nML
“Compare +
shortcut-logic”
instruction
CND &= Rj==Ri
CND |= Rj!=Ri
C
“If-else” with
multiple tests
Machine code
• Multiple “compare +
shortcut-logic”
• Single c-jump
Worst case is always
faster!
15
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Control Processing
Tmicro CPU Tnano ASIP
Rohc_parse program code size 347 x 16-bit 227 x 16-bit (-35%)
Rohc_parse cycle count per packet 191 87 (-55%)
Clock frequency (28nm HPM) 800 MHz 1 GHz (+25%)
Gate count (core only, 28nm HPM) 14K gates 5.4K gates (-61%)
Results – Header Parser
16
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Agenda
1. Robust Header Compression (ROHC) in network
processing
2. Application-Specific Processor (ASIP) methodology
3. Accelerating control processing in ROHC
4. Accelerating data processing in ROHC
5. Conclusions
17
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Data Processing
• Implementation styles
– Software on processor: too slow?
– Hardware co-processors: (manual) design effort, synchronization
challenge?
– Hardware-accelerated instructions in ASIP instruction set:
well supported by tools, potential for resource sharing!
Header
Parser
Header Field
Encoder
Packet
Modification
Buffer
Feedback
Buffer
Context
Processor
CRC
Con-
Text
Mem
CRC
WLSB encoder
Scaled / Timer-Based RTP
Timestamp Compression
….
18
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Data Processing
WLSB Encoder: SW Implementation Tmicro (gen.-purp. CPU)
nML
General-purpose ALU:
add, sub, shift, mask…
C
Software implementation
of WLSB encoder: for-
loop with called function
Machine code
• 30 instructions
for called
function
• 6-packet test
program:
2110 cycles
19
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Data Processing
WLSB Encoder: HW-Accelerated Instruction Tnano (optimized ASIP)
nML (ISA view)
WLSB encoder
instruction, calling
hardware primitive
C
Intrinsic function call
to WLSB encoder
instruction
Machine code
• Called function
replaced by single
instruction
• 6-packet test
program: 267 cycles
(7.9x speedup)
nML (behavioral view)
• WLSB hardware primitive
in bit-accurate C code
• Auto-translated to RTL
20
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Accelerated Data Processing
Results: Adding HW-Accelerated Instructions
Tmicro
CPU
Tnano ASIP Tnano ASIP
w/ WLSB instr
WLSB 6-packet test program
code size
134 x 16-bit 126 x 16-bit 84 x 16-bit (-33%)
WLSB 6-packet test program
cycle count
2122 2110 267 (-87%)
Clock frequency
(28nm HPM)
800 MHz 1 GHz 1 GHz (0%)
Gate count
(core only, 28nm HPM)
14K gates 5.4K gates 6.3K gates (+16%)
21
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Agenda
1. Robust Header Compression (ROHC) in network
processing
2. Application-Specific Processor (ASIP) methodology
3. Accelerating control processing in ROHC
4. Accelerating data processing in ROHC
5. Conclusions
22
© 2016 Synopsys, Inc. All rights reserved.
May 9, 2016
Conclusions
• Application-Specific Processors (ASIP)
– Enable acceleration of control and data processing, similar to
fixed-function hardware
– Flexibility of a software-programmable processor
• ASIP Designer allows to design ASIPs quickly
– Architectural exploration: Compiler-in-the-Loop
– SDK generation
– RTL generation
• Benefits illustrated with Robust Header Compression
(ROHC) case study

Gert Goossens,Sen. Director, ASIP Tools, Synopsys

  • 1.
    1 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Case study: Performance-efficient Implementation of Robust Header Compression (ROHC) using an Application-Specific Processor Gert Goossens, Patrick Verbist, Erik Brockmeyer, Luc De Coster Synopsys
  • 2.
    2 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Agenda 1. Robust Header Compression (ROHC) in network processing 2. Application-Specific Processor (ASIP) methodology 3. Accelerating control processing in ROHC 4. Accelerating data processing in ROHC 5. Conclusions
  • 3.
    3 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 ROHC in Network Processing ROHC compressor • 1.2 Mpackets/s • 600MHz clock  500 cycles/packet − Header Parser: ~100 cycles/packet − Encoder+Context+CRC: ~400 cycles/packet • Optimize for worst-case control path High Performance Streaming Data (IP/UDP/RTP Protocol) IP Header 20-40 bytes UDP Hdr 8 bytes RTP Header 12 bytes Payload Video/Audio… ROHC Header Payload Video/Audio… ROHC Compressor ROHC DecompressorRadio or Cable Link Header Parser Header Field Encoder Packet Modification Buffer Feedback Buffer Context Processor CRC Con- Text Mem
  • 4.
    4 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Header Parser Header Field Encoder Packet Modification Buffer Feedback Buffer Context Processor CRC Con- Text Mem ROHC Implementation █ Blocks requiring efficient control-flow  Tiny microprocessor with efficient branching and logic operations █ Blocks requiring efficient control-flow and data processing  Tiny microprocessor with hardware-accelerated instructions ASIP technology enables the design of such processors
  • 5.
    5 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Agenda 1. Robust Header Compression (ROHC) in network processing 2. Application-Specific Processor (ASIP) methodology 3. Accelerating control processing in ROHC 4. Accelerating data processing in ROHC 5. Conclusions
  • 6.
    6 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 ASIPs in SoC Design ASIP architectural optimization space Parallelism Specialization Instruction- level parallelism Data- level parallelism Task- level parallelism Orthogonal instruction set (VLIW) Encoded instruction set Vector processing (SIMD) Multi- core Applic.- specific data types Applic.- specific instructions Connectivity & storage matching application’s data-flow App.-spec. data processing App.-spec. memory addressing App.-spec. control processing Distributed regs, sub-ranges Multiple mem’s, sub-ranges Jumps, subroutines, interrupts, HW do-loops, residual control, predication Direct, indirect, post-modification, indexed, stack indirect… Any exotic operator Integer, fractional, floating-point, bits, complex, vector… Single or multi-cycle Relative or absolute, address range, delay slots Pipeline Multi- threading Pipeline depth Hazards: HW/SW stall, bypass Micro- processor Extensible Processor Application-Specific uP / DSP Programmable Datapath Hardwired Datapath
  • 7.
    7 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 “ASIP Designer” Tool-Suite
  • 8.
    8 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Agenda 1. Robust Header Compression (ROHC) in network processing 2. Application-Specific Processor (ASIP) methodology 3. Accelerating control processing in ROHC 4. Accelerating data processing in ROHC 5. Conclusions
  • 9.
    9 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Accelerated Control Processing • Architectural exploration with ASIP Designer • Starting point: “Tmicro” CPU – 16-bit gen.-purpose CPU (already leaner than 32-bit) – Variable-length instructions: arithmetic (16), move (16, 32), load/store (16, 32), control (16, 32, 48) Customization of a 16-bit CPU: “Strip Down & Beef Up” • End point: “Tnano” ASIP – 16-bit stripped CPU – Fixed-length instructions: arithmetic, move, load/store, control (16) – No multi-word decoding overhead – Improved clock frequency – Add compact control instructions to accelerate ROHC code – Predicated execution (Selection) – Field extraction (Masking) – Shortcut logic instructions
  • 10.
    10 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Accelerated Control Processing Control Path Balancing Longest control path Shortest control path • Example: Control-Flow Graph of Header Parser • Improve control path balancing by – C source code re-factorization – User-control on code hoisting – Predicated execution in tail of long control paths
  • 11.
    11 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Accelerated Control Processing If-Else, No Predication Tmicro (gen.-purp. CPU) nML Conditional jump instruction, 2-cycle branch penalty C Condition at tail of long control path Machine code Conditional jump with branch penalty: One of two delay slots filled, one ‘nop’ left
  • 12.
    12 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Accelerated Control Processing Predication Tnano (optimized ASIP) nML Select instruction C Condition at tail of long control path Machine code • Conditional code executes always • Result is used selectively  No branch penalty nML Predication Threshold
  • 13.
    13 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Accelerated Control Processing If-Else with Multiple Tests Tmicro (gen.-purp. CPU) nML Stand-alone compare instruction C “If-else” with multiple tests Machine code Multiple compare and c-jump instructions Slow in worst-case
  • 14.
    14 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Accelerated Control Processing If-Else with Multiple Tests Tnano (optimized ASIP) nML “Compare + shortcut-logic” instruction CND &= Rj==Ri CND |= Rj!=Ri C “If-else” with multiple tests Machine code • Multiple “compare + shortcut-logic” • Single c-jump Worst case is always faster!
  • 15.
    15 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Accelerated Control Processing Tmicro CPU Tnano ASIP Rohc_parse program code size 347 x 16-bit 227 x 16-bit (-35%) Rohc_parse cycle count per packet 191 87 (-55%) Clock frequency (28nm HPM) 800 MHz 1 GHz (+25%) Gate count (core only, 28nm HPM) 14K gates 5.4K gates (-61%) Results – Header Parser
  • 16.
    16 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Agenda 1. Robust Header Compression (ROHC) in network processing 2. Application-Specific Processor (ASIP) methodology 3. Accelerating control processing in ROHC 4. Accelerating data processing in ROHC 5. Conclusions
  • 17.
    17 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Accelerated Data Processing • Implementation styles – Software on processor: too slow? – Hardware co-processors: (manual) design effort, synchronization challenge? – Hardware-accelerated instructions in ASIP instruction set: well supported by tools, potential for resource sharing! Header Parser Header Field Encoder Packet Modification Buffer Feedback Buffer Context Processor CRC Con- Text Mem CRC WLSB encoder Scaled / Timer-Based RTP Timestamp Compression ….
  • 18.
    18 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Accelerated Data Processing WLSB Encoder: SW Implementation Tmicro (gen.-purp. CPU) nML General-purpose ALU: add, sub, shift, mask… C Software implementation of WLSB encoder: for- loop with called function Machine code • 30 instructions for called function • 6-packet test program: 2110 cycles
  • 19.
    19 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Accelerated Data Processing WLSB Encoder: HW-Accelerated Instruction Tnano (optimized ASIP) nML (ISA view) WLSB encoder instruction, calling hardware primitive C Intrinsic function call to WLSB encoder instruction Machine code • Called function replaced by single instruction • 6-packet test program: 267 cycles (7.9x speedup) nML (behavioral view) • WLSB hardware primitive in bit-accurate C code • Auto-translated to RTL
  • 20.
    20 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Accelerated Data Processing Results: Adding HW-Accelerated Instructions Tmicro CPU Tnano ASIP Tnano ASIP w/ WLSB instr WLSB 6-packet test program code size 134 x 16-bit 126 x 16-bit 84 x 16-bit (-33%) WLSB 6-packet test program cycle count 2122 2110 267 (-87%) Clock frequency (28nm HPM) 800 MHz 1 GHz 1 GHz (0%) Gate count (core only, 28nm HPM) 14K gates 5.4K gates 6.3K gates (+16%)
  • 21.
    21 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Agenda 1. Robust Header Compression (ROHC) in network processing 2. Application-Specific Processor (ASIP) methodology 3. Accelerating control processing in ROHC 4. Accelerating data processing in ROHC 5. Conclusions
  • 22.
    22 © 2016 Synopsys,Inc. All rights reserved. May 9, 2016 Conclusions • Application-Specific Processors (ASIP) – Enable acceleration of control and data processing, similar to fixed-function hardware – Flexibility of a software-programmable processor • ASIP Designer allows to design ASIPs quickly – Architectural exploration: Compiler-in-the-Loop – SDK generation – RTL generation • Benefits illustrated with Robust Header Compression (ROHC) case study