2017 09-ohkawa-MCSoC2017-presen

DESIGNING EFFICIENT PARALLEL
PROCESSING
IN 3D STANDARD-CHIP STACKING
SYSTEM WITH STANDARD BUS
Takeshi Ohkawa* Kanemitsu Ootsu* Takashi Yokota*
Katsuya Kikuchi** Masahiro Aoyagi**
* Dept. of Information Systems Science,
Graduate School of Engineering, Utsunomiya University
** 3D Integration System Group,
Nano-electronics Research Institute, National Institute of
Advanced Industrial Science and Technology (AIST)
2017/9/18 MCSoC2017@Korea Univ., Seoul 1

Outline
• 1. Introduction
• 2. 3D-SCSS (Standard Chip Stacked System)
Design Method
• Model of target 3D-SCSS system and TSV technology
• Model-driven parallel system design flow
• 3. Design Example
• Overview of image recognition (ORB feature extraction)
• Parallel processing for Scale Pyramid Generation Process
• Discussion on Processing Performance
• Estimation of Power Consumption
• 4. Conclusion

Research Background
• Requirement: High-performance at low energy
• Smartphones, tablets, information home appliances, IoT (Internet
of Things), M2M (Machine to Machine), automotive
• eg.) Image recognition (Local feature, DNN)
• General tradeoff: flexibility and energy efficiency
• Energy Efficiency
• ASIC: DSP/FPGA : SW＝1000:10:1
• Flexibility
• State-of-the-art algorithm changes fast!
• Heterogeneous integration of general/special LSIs
→ High energy-efficiency, High-performance

Heterogeneous Integration of LSIs
• Conventional
• Electronic Circuit Board integration
• TSV(Through Silicon Via) technology
• extends the limit of horizontal integration
• opens holes (Via) through silicon chips to connect
electric signals and power supply vertically to the back
of the chips
• Related technologies
• Interposer (a chip dedicated for inter-chip wiring,
electrical [7] or optical [8])
• wireless (vertical) communications between chips [9]
[7] Kurita, Yoichiro, et al. "A novel" SMAFTI" package for inter-chip wide-band data transfer." Electronic Components and
Technology Conference, 2006. Proceedings. 56th. IEEE, 2006.
[8] Arakawa, Yasuhiko, et al. "Silicon photonics for next generation system integration platform." IEEE Communications
Magazine 51.3 (2013), pp. 72-77, 2013.
[9] Miura, Noriyuki, et al. "Analysis and design of inductive coupling and transceiver circuit for inductive inter-chip wireless
superconnect." IEEE Journal of Solid-State Circuits 40.4, pp. 829-837, 2005.

Research Objectives
• Proposal of 3D-SCSS with Standard BUS
(Three-Dimensional Standard-Chip Stacked System)
• To improve design productivity to satisfy performance and
energy requirement by exploiting parallelism of algorithm
• Study of an example case design
• Mapping parallel processing of image recognition
• Evaluate the performance and energy efficiency

Outline
• 1. Introduction
Design Method
• 4. Conclusion

Our Proposal: 3D-SCSS with Standard BUS
(Three-Dimensional Standard-Chip Stacked System)
• Heterogeneous 3D Integration of standard and special-
purpose LSI chips for performance/energy
• Define “Standard BUS” for Stock and Reuse
• By defining Standard Socket (like PCI on PC
Motherboard), just stacking the stocked chips for
building-up desired system.
• Physical size, layout
• Electrical voltage and impedance
• Layered communication protocols:
datalink, network, transport, application…
• Our GOAL
• To improve design productivity to satisfy performance and
energy requirement at application level

3D Interconnect for Multi Core Internal Bus
■ 2D Interconnect
RF/
Analog
Memory
Logic
I/O
64 or 128-bit
On-Chip Bus
・・・Horizontal
2D system
- Long wiring for bus communication
- Limitation of signal line cumber
- Large size bus-driver circuits
- Many repeater buffer circuits
- Difficult of Integration
of heterogeneous chips
■ 3D Interconnect
3D system
- Short wiring for bus communication
- Architecture of wide Interchip bus
- Smaller size of bus interface circuits
(low capacitance TSVs)
TSVs/
micro-bumps
Array
1600-bit Wide Interchip Bus
・・・Vertical
TSV: Though Si Via
Heterogeneous Multi
LSI Chip Stack System
3D Interchip Communication with High Data Transfer
Rate between Heterogeneous Multi Chips
Multi Core System LSI
Presented in 3D Test Workshop, Anaheim CA, USA, September 13, 2013
2017 update!
20,000-bit

Wide Bus Chip-to-Chip Interconnection to Realize
Scalable Stacking for Multi LSI Chip Stack System
Scalable
Stacking for
Heterogeneous
Multi LSI
Chip Stack
System
Si Interposer
Package Substrate
COOL Interconnect:
Wide Bus Chip-to-Chip Interconnection
Standard Interface Circuits TSV (10mmF, 50mmD)
Micro-bump
50μm
- Chip level stack process for chip stacking : Low internal stress bonding
- High density of 3D interconnect: fine-TSVs/micro-bumps, fine-pitch

Wide Bus Communication
Test LSI Device
Ultra Wide Bus Interface Circuits Block
Occupied Area : 2.16mm-square
Power, GND： 400 Al pads
Outer Connection
Wiring Bonding Pads
Al Pad Array Area: 2mm-square
40x40(=1600), 20μm-sq., 50μm-pitch
Chip Size 8.3 mm x 6.0 mm
Clock Freq. 50 MHz
Power Voltage 2.5 V
Bus Signal Number 1600
Bus Data Rate 6.4 GB/s
@1024 bit, 50 MHz
Bus Occupied Area 4.67 mm2
TSV/Bump Area 4 mm2 (86 %)
Driver Circuits Area 0.67 mm2 (14 %)
0.25μm-CMOS Technology

Summary of 3D-SCSS standard BUS [10]
Parameter Value
Physical size 2 mm x 2 mm
BUS location Center of the chip
Number of TSVs and bumps
[TSVs for data signal]
1600 (40x40)
[1024]
Signal Frequency 50 MHz
Communication Capacity 51.2 Gbps
Power consumption
(Flip-chip result)
97mW *
@ 50% toggle rate
* Only the I/O power is measured separately.
Aoyagi, M.; Imura, F.; Nemoto, S.; Watanabe, N.; Kato, F.; Kikuchi, K.; Nakagawa, H.; Hagimoto, M.; Uchida,
H.; Matsumoto, Y., "Wide bus chip-to-chip interconnection technology using fine pitch bump joint array for 3D
LSI chip stacking," 2nd IEEE CPMT Symposium Japan, Kyoto, 2012, pp. 1-4, 2012.

Design method of 3D-SCSS using
standard BUS
• Conventional: There is no standard access method
between vertically stacked chips.
• Intra-chip: CPU/Memory BUS (eg. ARM AXI4)
• Inter-chip: High-speed serial, Ethernet, …
• Consider the future scalability/flexibility!
• Each chip would be enough complex system.
• Loosely-coupled architecture is needed.

Mapping KPN[16] on 3D-SCSS
• KPN[16]: Kahn Process Network (Process and FIFO model)
• Mapping
• A process onto a processor element on a chip
• A buffer onto a memory element on a chip
• Application layer
• Process to process data exchange through a buffer
• Control process to reduce the KPN[16] complexity
Proc
ess
Proc
ess
Proc
ess
Control
process
[16] G. Kahn, “The semantics of a simple language for parallel programming,” Proc. of the IFIP
Congress 74. North-Holland Publishing Co., pp. 471-475, 1974

Outline
• 1. Introduction
Design Method
• 4. Conclusion

Example Process Network for Image
Recognition (ORB Feature Extraction)
• Image recognition using ORB Feature Descriptor
• Process1: Preprocessing of the input image
• Process2: Scaled Image Generation
• Process3: Key-point extraction
• Process4: Feature description
• Process5: Matching/ Machine Learning
• Process6: Output of the Recognition result
P2
Scaled
Image
Generat
ion
P3
Key-
point
extracti
on
P1
Preproc
essing
of the
input
image
Input image
Resized images
P4
Feature
descripti
on
P5
Matching
/
Machine
Learning
P6
Output
of the
Recogni
tion
result
Preprocessed image
Image + Key-point
information
Feature Desctiptor
Maching / Learning
Result
Controller

Sub Process Network for Scaled
Images Generation Process
(a)
Image
Scaling
1/1.2
(b)
Image
Scaling
1/1.2
(g)
Image
Scaling
1/1.2
(A)
Image
Scaling
1/1.2
(B)
Image
Scaling
1/1.44
(G)
Image
Scaling
1/3.58
1 1/1.2 1/1.44 1/2.99 1/3.58
1 1/1.2
1/1.44
1/3.58
(b) Independent
Resize
(a) Iterative
Resize

Processing time in PC environment
• Image size: 4096×2380
• Software: OpenCV2.4.6.1 (ORB descriptor, resize)
• Resize algorithm: linear interpolation
• Hardware:
• CPU: AMD Phenom II
905e(2.5GHz)
• Observation
• Independent resize
takes more time
• Reason: large input
image size
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7
ProcessingTime(ms)
The level of the Scale Pyramid
Iterative Image Resize
(1/1.2 x 8 times)
Independent Image
Resize (1/1.2)^n

3D-SCSS Mapping Example
(case 1: Iterative Resize)
Memory Chip
Processor Chip1
R 10MB/s
W 7MB/s
Data transfer rate @ 10fps [MB/s]
R 94MB/s
W 65MB/s
R 65MB/s
W 45MB/s
R 281MB/s
W 194MB/s
Processor Chip2
Processor Chip7
POINTS
・7 chips
・Each chip works independently
・no need of sync
・All the results are written to memory
0
20
40
60
80
100
a b c d e f g
Datarate(MB/s)
Read
Write
(a)
resize
1/1.2
(b)
resize
1/1.2
(g)
resize
1/1.2
9.4MB
6.5MB 4.5MB
1.0MB 0.7MB
6.5MB
Processor
On-chip memory
FIFO buffers are assign to memory chip
FIFO：assigned to memory chip
4K(4096x2304)
1byte/pixel

3D-SCSS Mapping Example
(case 2: Independent Resize)
Memory Chip
Processor Chip1
(R 94MB/s)
W 7MB/s
R 94MB/s
W 65MB/s
(R 94MB/s)
W 76MB/s
R 658 (94)MB/s
W 194MB/s
Processor Chip2
Processor Chip7
（A）
resize
1/1.2
（B）
resize
1/1.44
（G）
resize
1/3.58
0
20
40
60
80
100
A B C D E F G
Datarate(MB/s)
Read
Write
9.4MB 6.5MB
4.5MB
0.7MB
0
20
40
60
80
100
A B C D E F G
Datarate(MB/s)
Read
Write
*with broadcast
Processor
On-chip memory
FIFO：assigned to memory chip
FIFO buffers are assign to memory chip
4K(4096x2304)
1byte/pixel
Data transfer rate @ 10fps [MB/s]
POINTS
・Broadcast may reduce data transfer
・need of sync when broadcasting
・All the results are written to memory

Discussion
• Case 2: Independent Resize is better
in terms of data transfer size
• Data transfer reduces with broadcasting!
• Different tradeoff from the
conventional system.
0.0
0.5
1.0
1.5
2.0
2.5
3.0
a b c d e f g
DataTransferTime(ms)
Sub Process Name
Write
Read
0.0
0.5
1.0
1.5
2.0
2.5
3.0
A B C D E F G
DataTransferTime(ms)
Sub Process Name
Write
Read
Case 2: independent resizeCase 1: iterative resize
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7
ProcessingTime(ms)
The level of the Scale Pyramid
Iterative Image Resize
(1/1.2 x 8 times)
Independent Image
Resize (1/1.2)^n
PC

Power estimation for data transfer
• TSV electric capacity: 0.3pF
• Energy for 1-bit transfer: 0.3pJ (@1.0V)
• Q[C]=CV, E[J]=QV=CV2
• E=0.3[pF] x (1.0)2 [V2]=0.3 [pJ]
• 1 byte(=8bit) transfer: 2.4[pJ]
• Estimated results (10fps)
• Minimum 691.2μW
(69.12μJ per frame)
0
500
1,000
1,500
2,000
2,500
(a) Iterative (b) Independent (b') Independent,
broadcast
Powerconsumption@10fps(µW)
Network Mapping
Write [µW]
Read [µW]

Conclusion
• Proposed the design method of 3D-SCSS(Three-
Dimensional Standard-Chip Stacked System) with
Standard BUS
• Stacking by: TSV(Through Silicon Via)+Bump
• Design method: KPN mapping
• A design case of image scaling is studied
• Improvement in performance can be expected by
introducing another type of parallel processing, which is
different tradeoff from that of under normal PC environment.
• Communication and synchronization mechanism
• By realizing broadcasting communication in the 3D-SCSS, it
is expected to reduce further energy consumption.

THANK YOU

2017 09-ohkawa-MCSoC2017-presen

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2017 09-ohkawa-MCSoC2017-presen

Similar to 2017 09-ohkawa-MCSoC2017-presen (20)

Recently uploaded

Recently uploaded (20)

2017 09-ohkawa-MCSoC2017-presen