This presentation was given at the Green500 BoF at SC21, in which PFN's VP of Computing Infrastructure Yusuke Doi discussed the power measurement for PFN's MN-3 supercomputer with MN-Core™ accelerators and how the company improved MN-3's power efficiency from 29.7GF/W to 39.38GF/W in 5 months.
More about MN-Core: https://projects.preferred.jp/mn-core/en/
More about MN-3: https://projects.preferred.jp/supercomputers/en/
3. Preferred Networks Inc.
Industry Domains
Transportation Manufacturing Life Sciences Materials Robots Entertainment
Founded March 2014
Directors
CEO Toru Nishikawa
COO Daisuke Okanohara
CTO Ryosuke Okuta
Located
Tokyo, Japan (HQ)
Burlingame, CA., US
(Preferred Networks America, Inc.)
Make the real world computable
4. How much information can you extract from a single image?
Our pixel accuracy
object detection model
extracts large amount of rich features
from single image using
● State-of-art algorithm
● Hyperparameter tuning and
optimization using Optuna™
● Proprietary CG-based annotation-
free data generation and
augmentation combined with
domain transfer to real image
● Distributed / large-batch training
5. 5
Behavior model
with neural network
Computational Chemistry with Deep Learning
Searching for
new
materials
over
computers
Atom
Energy
and
Force
physical
property
Molecular
Dynamics
6. Our Capabilities
Deep Learning
World class researchers
focusing on deep learning
Expertise
Wide range of deep expertise from
robotics to genomics to
computational chemistry
World class computational
resources designed for deep
learning application
Private Super
Computer
Software
In-house developments of OSS and
hyperparameter tuning library to
accelerate software development
7. MN-3 and MN-Core: Deep Learning Supercomputer
7
MN-Core MN-Core Board x 4
CPU Intel Xeon 8260M 2way (48 physical cores)
Memory 384GB DDR4
Storage Class Memory 3TB Intel Optane DC Persistent Memory
Network
MN-Core DirectConnect(112Gbps) x 2
Mellanox ConnectX-6(100GbE) x 2
On board(10GbE) x 2
MN-3 node specs
Deep learning processor MN-Core
For more information please visit: https://projects.preferred.jp/supercomputers/en/
8. MN-3 is the world’s most energy efficient supercomputer for deep learning.
We use HPL to understand how to run our computer efficiently.
Green500 / TOP500 history:
● 2021/11, 39.38GFlops/W (Green500 #1 / TOP500 #301)
● 2021/06, 29.70GFlops/W (Green500 #1 / TOP500 #335)
● 2020/11, 26.04GFlops/W (Green500 #2 / TOP500 #330)
● 2020/06, 21.11GFlops/W (Green500 #1 / TOP500 #393)
MN-3 and HPL
9. Giant SIMD Processor
● Single instruction stream
● 500W/Package @ 32.8TF(DP)
○ 65GF/W on chip (ideal case)
● Hierarchical structure with unique on-chip
network (broadcast, aggregation, etc)
● Deterministic/transparent from software
○ no cache, software shall manage data
copy between each layers
MN-Core
10. Philosophy behind MN -Core Hardware
By providing only the functions necessary for
computation and controlling them completely with
software, we can achieve high execution
efficiency/power efficiency with minimal hardware.
This is difficult to achieve with cache-based parallel
processors whose behavior is hidden to software.
Prof. Makino
(Kobe Univ.)
Idea Behind MN -Core : Transparent Hardware for
High Performance
12. ● Level 2 in our first Green500 (June 2020)
● Upgraded to meet Level 3 requirements
○ #1 system should be one of the best
examples in power measurement
● We measure in level-3 criteria since our
second Green500 (Nov. 2020)
← upgrading power facility and measurement
devices to meet Level 3 requirements
Update to Level 3 power measurement
13. Level 3 Measurements: Power System of the MN-3
200V600A
3P3W
200V600A
3P3W
200V150A (3P3W) MN-3A (Zone 0) - 32nodes
Smart PDU x 10
Smart PDU x 8
MN-3 nodes x 16
MN-3 nodes x 16
MN-3 Interconnect
MN-3A (Zone 1)
Revenue grade meter
ME110SSR x 4
Power Analyzer
WT1800E (6elements)
MN-3 Power Measurements system TSDB
HPL program
Trigger
Feedback
via Slack bot, Web (Grafana)
15. Measurement system supporting continuous improvements
● The more iterations, the more improvements
● Key to rapid iterations: how we quickly share the results of experiments
● Automated reporting system
○ Issues a unique ID to each HPL run
○ Records timestamp of core/full phase with the ID
○ Generates summary and graph of power measurements
○ Share the results in Slack
● It helps us to quickly understand effects of development
18. ● 2020/06, 21.11GFlops/W, efficiency 41% (#1)
○ initial challenge, made it work
● 2020/11, 26.04GFlops/W, efficiency 53% (#2)
○ optimization on scheduling, GEMM
● 2021/06, 29.70GFlops/W, efficiency 58% (#1)
○ even more optimization
● 2021/11, 39.38GFlops/W, efficiency 64% (#1)
○ interconnect improvement, aggressive software-level clock
gating, even even more optimization
MN-Core Challenge on HPL
19. 58% → 64% (6pt gain)
● +2pt (58→60): Optimizing DGEMM kernel and re-organized
overlapping DGEMM and communication (swap and panel
broadcast)
● +3pt (60→63): Increased bandwidth of interconnect (MN-Core
DirectConnect) and more overlapping calc. and comm.
● +1pt (63→64): Optimizing other parts including panel factorization
and dynamic code generation
Execution Efficiency Improvement Breakdown
20. 29.70GF/W → 39.38GF/W (9.68GF/W gain)
● +3.4GF/W: Corresponding to the improvement of execution efficiency
● +4.4GF/W: Generating "energy-efficient instructions" by software:
stopping unused arithmetic units, using scratchpad FFs instead of
SRAMs, and reducing energy consumption of data copy, etc.
● +1.9GF/W: Other tuning including the core voltage and freq.
Power Efficiency Improvement Breakdown
Stop an ALU in PE
Reuse a matrix as much
as possible to reduce
data copy
21. ● Unique computation framework of MN-Core
○ Deterministic and transparent hardware fully controlled by software
○ Application-specific optimization
● HPL is very useful benchmark to understand efficiency of new-style computer in
real environment
○ Precise and integrated measurement is essential for continuous
improvement
Please visit us at booth #1521!
Summary