CV分野での最近の脱○○系論文3本を紹介します。
・脱ResNets: RepVGG: Making VGG-style ConvNets Great Again
・脱BatchNorm: High-Performance Large-Scale Image Recognition Without Normalization
・脱attention: LambdaNetworks: Modeling Long-Range Interactions Without Attention
M19_設計解析業務におけるクラウドエンジニアリングソリューションの活用と効果 [Microsoft Japan Digital Days]日本マイクロソフト株式会社
日本マイクロソフト株式会社
グローバルブラックベルトグループ HPC テクニカルスペシャリスト
倉石 英明
近年の製造業では、様々な業務シーンでデジタル技術を活用する試みが活発化しています。その中でも特に CAD や CAE、PLM を活用した設計解析分野では、製品ライフサイクルの短期化が進む中で、多種多様の複雑な解析をタイムリーに行う必要があり、外部リソースを活用した設計解析フローの効率化、最適化が求められています。また、これら業務を地理的な場所を問わずに、関連部署や関連企業とともに協業できる業務基盤の実現が注目されています。本セッションではマイクロソフト Azure が提供するクラウドエンジニアリングソリューションのご紹介と、その活用方法についてお話します。
【Microsoft Japan Digital Daysについて】
Microsoft Japan Digital Days は、お客様が競争力を高め、市場の変化に迅速に対応し、より多くのことを達成することを目的とした、日本マイクロソフトがお届けする最大級のデジタル イベントです。4 日間にわたる本イベントでは、一人一人の生産性や想像力を高め、クラウド時代の組織をデザインするモダンワークの最新事例や、変化の波をうまく乗り切り、企業の持続的な発展に必要なビジネスレジリエンス経営を支えるテクノロジの最新機能および、企業の競争優位性に欠かせないクラウド戦略のビジョンなどデジタル時代に必要な情報をお届けいたしました。(2021年10月11日~14日開催)
Molecular dynamics (MD) is a very useful tool to understand various phenomena in atomistic detail. In MD, we can overcome the size- and time-scale problems by efficient parallelization. In this lecture, I’ll explain various parallelization methods of MD with some examples of GENESIS MD software optimization on Fugaku.
44. 2020年5月7日 計算科学技術特論B
44
DGEMM tuned for the K computer was also used for the
LINPACK benchmark program.
5.2 Scalability
We measured the computation time for the SCF iterations with
communications for the parallel tasks in orbitals, however, was
actually restricted to a relatively small number of compute nodes,
and therefore, the wall clock time for global communications of
the parallel tasks in orbitals was small. This means we succeeded
in decreasing time for global communication by the combination
Figure 6. Computation and communication time of (a) GS, (b) CG, (c) MatE/SD and (d)
RotV/SD for different numbers of cores.
0.0
40.0
80.0
120.0
160.0
0 20000 40000 60000 80000
TimeperCG(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
0.0
100.0
200.0
300.0
400.0
0 20,000 40,000 60,000 80,000
TimeperGS(sec.)
Number of cores
theoretical computation
computation
global/space
global/orbital
wait/orbital
0.0
50.0
100.0
150.0
200.0
0 20,000 40,000 60,000 80,000
TimeperMatE/SD(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
0.0
100.0
200.0
300.0
0 20,000 40,000 60,000 80,000
TimeperRotV/SD(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
(d)(c)
(a) (b)
time as a result of keeping the block data on the L1 cache
manually decreased by 12% compared with the computation time
for the usual data replacement operations of the L1 cache. This
DGEMM tuned for the K computer was also used for the
LINPACK benchmark program.
5.2 Scalability
We measured the computation time for the SCF iterations with
hand, the global communication time for the parallel tasks in
orbitals was supposed to increase as the number of parallel tasks
in orbitals increased. The number of MPI processes requiring
communications for the parallel tasks in orbitals, however, was
actually restricted to a relatively small number of compute nodes,
and therefore, the wall clock time for global communications of
the parallel tasks in orbitals was small. This means we succeeded
in decreasing time for global communication by the combination
Figure 6. Computation and communication time of (a) GS, (b) CG, (c) MatE/SD and (d)
RotV/SD for different numbers of cores.
0.0
40.0
80.0
120.0
160.0
0 20000 40000 60000 80000
TimeperCG(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
0.0
100.0
200.0
300.0
400.0
0 20,000 40,000 60,000 80,000
TimeperGS(sec.)
Number of cores
theoretical computation
computation
global/space
global/orbital
wait/orbital
0.0
50.0
100.0
150.0
200.0
0 20,000 40,000 60,000 80,000
TimeperMatE/SD(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
0.0
100.0
200.0
300.0
0 20,000 40,000 60,000 80,000TimeperRotV/SD(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
(d)(c)
(a) (b)
RSDFTの高並列化-スケーラビリティ-
並列度のミスマッチの解消
大域通信の増大の解消
45. 2020年5月7日 計算科学技術特論B
45
confinement becomes prominent. The quantum effects,
which depend on the crystallographic directions of the nano-
wire axes and on the cross-sectional shapes of the nanowires,
result in substantial modifications to the energy-band
structures and the transport characteristics of SiNW FETs.
However, knowledge of the effect of the structural mor-
phology on the energy bands of SiNWs is lacking. In addi-
tion, actual nanowires have side-wall roughness. The
effects of such imperfections on the energy bands are
Table 2. Distribution of computational costs for an iteration of the SCF calculation of the modified code.
Procedure block
Execution
time (s)
Computation
time (s)
Communication time (s)
Performance
(PFLOPS/%)Adjacent/grids Global/grids Global/orbitals Wait/orbitals
SCF 2903.10 1993.89 61.73 823.02 12.57 11.89 5.48/51.67
SD 1796.97 1281.44 13.90 497.36 4.27 – 5.32/50.17
MatE/SD 525.33 363.18 13.90 143.98 4.27 – 6.15/57.93
EigenSolve/SD 492.56 240.66 – 251.90 – – 0.01/1.03
RotV/SD 779.08 677.60 – 101.48 – – 8.14/76.70
CG 159.97 43.28 47.83 68.85 0.01 – 0.06/0.60
GS 946.16 669.17 – 256.81 8.29 11.89 6.70/63.10
The test model was a SiNW with 107,292 atoms. The numbers of grids and orbitals were 576 Â 576 Â 180, and 230,400, respectively. The numbers of
parallel tasks in grids and orbitals were 27,648 and three, respectively, using 82,944 compute nodes. Each parallel task had 2160 grids and 76,800 orbitals.
Hasegawa et al. 13
Article
Performance evaluation of ultra-large-
scale first-principles electronic structure
calculation code on the K computer
Yukihiro Hasegawa1
, Jun-Ichi Iwata2
, Miwako Tsuji1
,
Daisuke Takahashi3
, Atsushi Oshiyama2
, Kazuo Minami1
,
Taisuke Boku3
, Hikaru Inoue4
, Yoshito Kitazawa5
,
Ikuo Miyoshi6
and Mitsuo Yokokawa7,1
Abstract
The International Journal of High
Performance Computing Applications
1–21
ª The Author(s) 2013
Reprints and permissions:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/1094342013508163
hpc.sagepub.com
Yukihiro Hasegawa et al.,
http://hpc.sagepub.com/
Computing Applications
International Journal of High Performance
http://hpc.sagepub.com/content/early/2013/10/16/1094342013508163
The online version of this article can be found at:
DOI: 10.1177/1094342013508163
published online 17 October 2013International Journal of High Performance Computing Applications
Hikaru Inoue, Yoshito Kitazawa, Ikuo Miyoshi and Mitsuo Yokokawa
Yukihiro Hasegawa, Jun-Ichi Iwata, Miwako Tsuji, Daisuke Takahashi, Atsushi Oshiyama, Kazuo Minami, Taisuke Boku,
K computer
Performance evaluation of ultra-largescale first-principles electronic structure calculation code on the
Published by:
RSDFTの高並列化
総合性能
63. 2020年5月7日 計算科学技術特論B
63
スレッド並列化
キャッシュの有効利用-行列積化
0
50
100
150
200
250
300
orinigal original+DGEMM 22axis+1threads 22axis+8threads
[s]
Algorithm
Gram/Schmidt3orthonormalization
other
M× M
M× V
「京」
PHASEの高並列化・高性能化の結果
行列積
行列ベクトル積
第 4 章 高並列性能最適化技術の研究
三角の処理部分は行列ベクトル積計算が残っており、細字の 2 サブルーチンがこ
する。行列ベクトル積に比べ行列行列積は、8 から 15 倍程度性能が向上している
る。区間 5 全体で見ても実効効率 28%を達成した。
Table 4.11: PHASE の行列行列積化の結果
サブルーチン
時間
(sec)
比率
(%)
演算効率
(%)
区間 5 512.7 100.0 28.23
m ES F transpose r 105.3 20.5 0
m ES W transpose r 15.7 3.1 0
WSW t 15.2 3.0 0.036
normalize bp and psi t 0.9 0.2 3.25
W1SW2 t r 49.2 9.6 5.46
modify bp and psi t r 50.8 9.9 4.45
W1SW2 t r block 162.0 31.6 41.89
modify bp and psi t r block 96.2 18.8 74.65
m ES W transpose back r 14.1 2.8 0
m ES F transpose back r 1.3 0.3 0
「京」