This document proposes a design method for efficient parallel processing in 3D standard-chip stacking systems using a standard bus. It presents a model for mapping parallel algorithms to a 3D-SCSS and describes a design flow. As an example, it maps the scale pyramid generation process of an image recognition algorithm across multiple processor chips in the 3D-SCSS. Analysis shows the independent resize approach reduces data transfer compared to iterative resize, though it requires synchronization. Estimated power consumption is a minimum of 691.2μW for data transfer at 10 frames per second.
Scanning the Internet for External Cloud Exposures via SSL Certs
2017 09-ohkawa-MCSoC2017-presen
1. DESIGNING EFFICIENT PARALLEL
PROCESSING
IN 3D STANDARD-CHIP STACKING
SYSTEM WITH STANDARD BUS
Takeshi Ohkawa* Kanemitsu Ootsu* Takashi Yokota*
Katsuya Kikuchi** Masahiro Aoyagi**
* Dept. of Information Systems Science,
Graduate School of Engineering, Utsunomiya University
** 3D Integration System Group,
Nano-electronics Research Institute, National Institute of
Advanced Industrial Science and Technology (AIST)
2017/9/18 MCSoC2017@Korea Univ., Seoul 1
2. Outline
• 1. Introduction
• 2. 3D-SCSS (Standard Chip Stacked System)
Design Method
• Model of target 3D-SCSS system and TSV technology
• Model-driven parallel system design flow
• 3. Design Example
• Overview of image recognition (ORB feature extraction)
• Parallel processing for Scale Pyramid Generation Process
• Discussion on Processing Performance
• Estimation of Power Consumption
• 4. Conclusion
2017/9/18 MCSoC2017@Korea Univ., Seoul 2
3. Research Background
• Requirement: High-performance at low energy
• Smartphones, tablets, information home appliances, IoT (Internet
of Things), M2M (Machine to Machine), automotive
• eg.) Image recognition (Local feature, DNN)
• General tradeoff: flexibility and energy efficiency
• Energy Efficiency
• ASIC: DSP/FPGA : SW=1000:10:1
• Flexibility
• State-of-the-art algorithm changes fast!
• Heterogeneous integration of general/special LSIs
→ High energy-efficiency, High-performance
2017/9/18 MCSoC2017@Korea Univ., Seoul 3
4. Heterogeneous Integration of LSIs
• Conventional
• Electronic Circuit Board integration
• TSV(Through Silicon Via) technology
• extends the limit of horizontal integration
• opens holes (Via) through silicon chips to connect
electric signals and power supply vertically to the back
of the chips
• Related technologies
• Interposer (a chip dedicated for inter-chip wiring,
electrical [7] or optical [8])
• wireless (vertical) communications between chips [9]
2017/9/18 MCSoC2017@Korea Univ., Seoul 4
[7] Kurita, Yoichiro, et al. "A novel" SMAFTI" package for inter-chip wide-band data transfer." Electronic Components and
Technology Conference, 2006. Proceedings. 56th. IEEE, 2006.
[8] Arakawa, Yasuhiko, et al. "Silicon photonics for next generation system integration platform." IEEE Communications
Magazine 51.3 (2013), pp. 72-77, 2013.
[9] Miura, Noriyuki, et al. "Analysis and design of inductive coupling and transceiver circuit for inductive inter-chip wireless
superconnect." IEEE Journal of Solid-State Circuits 40.4, pp. 829-837, 2005.
5. Research Objectives
• Proposal of 3D-SCSS with Standard BUS
(Three-Dimensional Standard-Chip Stacked System)
• To improve design productivity to satisfy performance and
energy requirement by exploiting parallelism of algorithm
• Study of an example case design
• Mapping parallel processing of image recognition
• Evaluate the performance and energy efficiency
2017/9/18 MCSoC2017@Korea Univ., Seoul 5
6. Outline
• 1. Introduction
• 2. 3D-SCSS (Standard Chip Stacked System)
Design Method
• Model of target 3D-SCSS system and TSV technology
• Model-driven parallel system design flow
• 3. Design Example
• Overview of image recognition (ORB feature extraction)
• Parallel processing for Scale Pyramid Generation Process
• Discussion on Processing Performance
• Estimation of Power Consumption
• 4. Conclusion
2017/9/18 MCSoC2017@Korea Univ., Seoul 6
7. Our Proposal: 3D-SCSS with Standard BUS
(Three-Dimensional Standard-Chip Stacked System)
• Heterogeneous 3D Integration of standard and special-
purpose LSI chips for performance/energy
• Define “Standard BUS” for Stock and Reuse
• By defining Standard Socket (like PCI on PC
Motherboard), just stacking the stocked chips for
building-up desired system.
• Physical size, layout
• Electrical voltage and impedance
• Layered communication protocols:
datalink, network, transport, application…
• Our GOAL
• To improve design productivity to satisfy performance and
energy requirement at application level
2017/9/18 MCSoC2017@Korea Univ., Seoul 7
8. 3D Interconnect for Multi Core Internal Bus
■ 2D Interconnect
RF/
Analog
Memory
Logic
I/O
64 or 128-bit
On-Chip Bus
・・・Horizontal
2D system
- Long wiring for bus communication
- Limitation of signal line cumber
- Large size bus-driver circuits
- Many repeater buffer circuits
- Difficult of Integration
of heterogeneous chips
■ 3D Interconnect
3D system
- Short wiring for bus communication
- Architecture of wide Interchip bus
- Smaller size of bus interface circuits
(low capacitance TSVs)
TSVs/
micro-bumps
Array
1600-bit Wide Interchip Bus
・・・Vertical
TSV: Though Si Via
Heterogeneous Multi
LSI Chip Stack System
3D Interchip Communication with High Data Transfer
Rate between Heterogeneous Multi Chips
Multi Core System LSI
2017/9/18 MCSoC2017@Korea Univ., Seoul 8
Presented in 3D Test Workshop, Anaheim CA, USA, September 13, 2013
2017 update!
20,000-bit
9. Wide Bus Chip-to-Chip Interconnection to Realize
Scalable Stacking for Multi LSI Chip Stack System
Scalable
Stacking for
Heterogeneous
Multi LSI
Chip Stack
System
Si Interposer
Package Substrate
COOL Interconnect:
Wide Bus Chip-to-Chip Interconnection
Standard Interface Circuits TSV (10mmF, 50mmD)
Micro-bump
50μm
- Chip level stack process for chip stacking : Low internal stress bonding
- High density of 3D interconnect: fine-TSVs/micro-bumps, fine-pitch
2017/9/18 MCSoC2017@Korea Univ., Seoul 9
Presented in 3D Test Workshop, Anaheim CA, USA, September 13, 2013
10. Wide Bus Communication
Test LSI Device
Ultra Wide Bus Interface Circuits Block
Occupied Area : 2.16mm-square
Power, GND: 400 Al pads
Outer Connection
Wiring Bonding Pads
Al Pad Array Area: 2mm-square
40x40(=1600), 20μm-sq., 50μm-pitch
Chip Size 8.3 mm x 6.0 mm
Clock Freq. 50 MHz
Power Voltage 2.5 V
Bus Signal Number 1600
Bus Data Rate 6.4 GB/s
@1024 bit, 50 MHz
Bus Occupied Area 4.67 mm2
TSV/Bump Area 4 mm2 (86 %)
Driver Circuits Area 0.67 mm2 (14 %)
0.25μm-CMOS Technology
2017/9/18 MCSoC2017@Korea Univ., Seoul 10
Presented in 3D Test Workshop, Anaheim CA, USA, September 13, 2013
11. Summary of 3D-SCSS standard BUS [10]
Parameter Value
Physical size 2 mm x 2 mm
BUS location Center of the chip
Number of TSVs and bumps
[TSVs for data signal]
1600 (40x40)
[1024]
Signal Frequency 50 MHz
Communication Capacity 51.2 Gbps
Power consumption
(Flip-chip result)
97mW *
@ 50% toggle rate
2017/9/18 MCSoC2017@Korea Univ., Seoul 11
* Only the I/O power is measured separately.
Aoyagi, M.; Imura, F.; Nemoto, S.; Watanabe, N.; Kato, F.; Kikuchi, K.; Nakagawa, H.; Hagimoto, M.; Uchida,
H.; Matsumoto, Y., "Wide bus chip-to-chip interconnection technology using fine pitch bump joint array for 3D
LSI chip stacking," 2nd IEEE CPMT Symposium Japan, Kyoto, 2012, pp. 1-4, 2012.
12. Design method of 3D-SCSS using
standard BUS
• Conventional: There is no standard access method
between vertically stacked chips.
• Intra-chip: CPU/Memory BUS (eg. ARM AXI4)
• Inter-chip: High-speed serial, Ethernet, …
• Consider the future scalability/flexibility!
• Each chip would be enough complex system.
• Loosely-coupled architecture is needed.
2017/9/18 MCSoC2017@Korea Univ., Seoul 12
13. Mapping KPN[16] on 3D-SCSS
• KPN[16]: Kahn Process Network (Process and FIFO model)
• Mapping
• A process onto a processor element on a chip
• A buffer onto a memory element on a chip
• Application layer
• Process to process data exchange through a buffer
• Control process to reduce the KPN[16] complexity
2017/9/18 MCSoC2017@Korea Univ., Seoul 13
Proc
ess
Proc
ess
Proc
ess
Control
process
[16] G. Kahn, “The semantics of a simple language for parallel programming,” Proc. of the IFIP
Congress 74. North-Holland Publishing Co., pp. 471-475, 1974
14. Outline
• 1. Introduction
• 2. 3D-SCSS (Standard Chip Stacked System)
Design Method
• Model of target 3D-SCSS system and TSV technology
• Model-driven parallel system design flow
• 3. Design Example
• Overview of image recognition (ORB feature extraction)
• Parallel processing for Scale Pyramid Generation Process
• Discussion on Processing Performance
• Estimation of Power Consumption
• 4. Conclusion
2017/9/18 MCSoC2017@Korea Univ., Seoul 14
15. Example Process Network for Image
Recognition (ORB Feature Extraction)
• Image recognition using ORB Feature Descriptor
• Process1: Preprocessing of the input image
• Process2: Scaled Image Generation
• Process3: Key-point extraction
• Process4: Feature description
• Process5: Matching/ Machine Learning
• Process6: Output of the Recognition result
P2
Scaled
Image
Generat
ion
P3
Key-
point
extracti
on
P1
Preproc
essing
of the
input
image
Input image
Resized images
P4
Feature
descripti
on
P5
Matching
/
Machine
Learning
P6
Output
of the
Recogni
tion
result
Preprocessed image
Image + Key-point
information
Feature Desctiptor
Maching / Learning
Result
Controller
2017/9/18 MCSoC2017@Korea Univ., Seoul 15
17. Processing time in PC environment
• Image size: 4096×2380
• Software: OpenCV2.4.6.1 (ORB descriptor, resize)
• Resize algorithm: linear interpolation
• Hardware:
• CPU: AMD Phenom II
905e(2.5GHz)
• Observation
• Independent resize
takes more time
• Reason: large input
image size
2017/9/18 MCSoC2017@Korea Univ., Seoul 17
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7
ProcessingTime(ms)
The level of the Scale Pyramid
Iterative Image Resize
(1/1.2 x 8 times)
Independent Image
Resize (1/1.2)^n
18. 3D-SCSS Mapping Example
(case 1: Iterative Resize)
Memory Chip
Processor Chip1
R 10MB/s
W 7MB/s
Data transfer rate @ 10fps [MB/s]
R 94MB/s
W 65MB/s
R 65MB/s
W 45MB/s
R 281MB/s
W 194MB/s
Processor Chip2
Processor Chip7
POINTS
・7 chips
・Each chip works independently
・no need of sync
・All the results are written to memory
0
20
40
60
80
100
a b c d e f g
Datarate(MB/s)
Read
Write
(a)
resize
1/1.2
(b)
resize
1/1.2
(g)
resize
1/1.2
9.4MB
6.5MB 4.5MB
1.0MB 0.7MB
6.5MB
Processor
On-chip memory
FIFO buffers are assign to memory chip
FIFO:assigned to memory chip
4K(4096x2304)
1byte/pixel
2017/9/18 MCSoC2017@Korea Univ., Seoul 18
19. 3D-SCSS Mapping Example
(case 2: Independent Resize)
Memory Chip
Processor Chip1
(R 94MB/s)
W 7MB/s
R 94MB/s
W 65MB/s
(R 94MB/s)
W 76MB/s
R 658 (94)MB/s
W 194MB/s
Processor Chip2
Processor Chip7
(A)
resize
1/1.2
(B)
resize
1/1.44
(G)
resize
1/3.58
0
20
40
60
80
100
A B C D E F G
Datarate(MB/s)
Read
Write
9.4MB 6.5MB
4.5MB
0.7MB
0
20
40
60
80
100
A B C D E F G
Datarate(MB/s)
Read
Write
*with broadcast
Processor
On-chip memory
FIFO:assigned to memory chip
FIFO buffers are assign to memory chip
4K(4096x2304)
1byte/pixel
Data transfer rate @ 10fps [MB/s]
POINTS
・Broadcast may reduce data transfer
・need of sync when broadcasting
・All the results are written to memory
2017/9/18 MCSoC2017@Korea Univ., Seoul 19
20. Discussion
• Case 2: Independent Resize is better
in terms of data transfer size
• Data transfer reduces with broadcasting!
• Different tradeoff from the
conventional system.
2017/9/18 MCSoC2017@Korea Univ., Seoul 20
0.0
0.5
1.0
1.5
2.0
2.5
3.0
a b c d e f g
DataTransferTime(ms)
Sub Process Name
Write
Read
0.0
0.5
1.0
1.5
2.0
2.5
3.0
A B C D E F G
DataTransferTime(ms)
Sub Process Name
Write
Read
Case 2: independent resizeCase 1: iterative resize
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7
ProcessingTime(ms)
The level of the Scale Pyramid
Iterative Image Resize
(1/1.2 x 8 times)
Independent Image
Resize (1/1.2)^n
PC
21. Power estimation for data transfer
• TSV electric capacity: 0.3pF
• Energy for 1-bit transfer: 0.3pJ (@1.0V)
• Q[C]=CV, E[J]=QV=CV2
• E=0.3[pF] x (1.0)2 [V2]=0.3 [pJ]
• 1 byte(=8bit) transfer: 2.4[pJ]
• Estimated results (10fps)
• Minimum 691.2μW
(69.12μJ per frame)
0
500
1,000
1,500
2,000
2,500
(a) Iterative (b) Independent (b') Independent,
broadcast
Powerconsumption@10fps(µW)
Network Mapping
Write [µW]
Read [µW]
2017/9/18 MCSoC2017@Korea Univ., Seoul 21
22. Conclusion
• Proposed the design method of 3D-SCSS(Three-
Dimensional Standard-Chip Stacked System) with
Standard BUS
• Stacking by: TSV(Through Silicon Via)+Bump
• Design method: KPN mapping
• A design case of image scaling is studied
• Improvement in performance can be expected by
introducing another type of parallel processing, which is
different tradeoff from that of under normal PC environment.
• Communication and synchronization mechanism
• By realizing broadcasting communication in the 3D-SCSS, it
is expected to reduce further energy consumption.
2017/9/18 MCSoC2017@Korea Univ., Seoul 22