Cuda fortranの利便性を高めるfortran言語の機能

CUDA Fortranの利便性を高める
Fortran言語の機能
長岡技術科学大学技学研究院
電気電子情報工学専攻（電気系）
特任准教授出川智啓
2015/9/17 Prometech Simulation Conference 2015 1

内容
• Modern Fortran(90/95/2003)の簡単な紹介
• 1次元移流方程式のCUDA Fortran実装（例）
• オブジェクト指向プログラミングの導入による移植範
囲の限定
• まとめ

はじめに
• GPGPUの普及と裾野の広がり
• 多くの資産を持つFortranユーザからの要求
• CUDA Fortranの登場
• GPUの世代更新に伴う高性能化
• ハードウェアの高性能化
• プログラムの最適化・高速化に関する知見の蓄積
• GPGPUは黎明期から成熟期へ

はじめに
• プログラミング方法論の進展
• プログラムの保守性，拡張性，再利用性の向上
• 手続き型からオブジェクト指向プログラミングへ転換
• Fortranの対応状況
• Fortran2003以降でオブジェクト指向プログラミングが可能
• CUDA Fortranの情報＝CUDA Cと同じ処理をどう書くか
• Fortranの機能やプログラミング方法論と関連付けた情報は少ない
• CUDA Fortranから利用できるModern Fortranの
機能を紹介し，使用例を示す

CUDA Fortran
• FortranのNVIDIA GPU向け拡張
• PGI Fortranコンパイラで利用可能
• 2015年9月16日現在の最新版は15.7（7月13日リリース）
• CUDA Cを利用するが，新機能はFortranコンパイラが対応
しないと利用できない
• かけた労力と得られる利得（性能向上）のバランスがよい
• 並列計算の知識だけである程度の性能が得られる

PGI Fortranコンパイラ
• Fortran 77/90/95/2003コンパイラ
• Fortran 2008に一部対応
• 対応状況はよくない
• CUDA Fortranが利用できる機能は？
• CUDA Fortranの情報＝CUDA Cと同じ処理をどう書くか
• Modern Fortranの機能をCUDA Fortranから利用するた
めの情報は少ない

Fortran 90/95
• FORTRAN 77から大幅に進化
• 主な特徴
• implicit noneによる暗黙の型宣言の無効化
• 配列演算子の導入
• 配列の全要素に対する演算を一括して記述
• 配列を関数の返値に利用可能
• moduleによるカプセル化
• public, privateによる変数・手続き†の公開範囲の制御
†手続き（procedure）は関数とサブルーチンの総称

Fortran 90/95
• 主な特徴
• 柔軟なメモリ管理
• allocate/deallocate, pointerによる配列の動的管理
• 自動割付配列
• 実引数から配列サイズを取得し，自動で割付・解放される配列
• 再帰手続き
• 派生型の導入
• C言語の構造体に相当
• 手続きおよび演算子のオーバーロード
• 手続きの呼出名称を共通化
• どの手続きが呼び出されるかは実引数の型で判断

Fortran 2003
• Fortran 90/95のメジャーバージョンアップ
• オブジェクト指向プログラミングへの対応
• C言語との連携強化[1]
• 主な特徴
• 派生型を拡張
• 変数だけでなく手続きも包括して定義
• 継承，多相性，抽象型などの導入
• メモリ管理の強化
• source指定子による変数のクローン作成
[1]出川，PGI CUDA FortranとGPU最適化ライブラリの一連携法，
Prometech Simulation Conference 2014.

Fortran 90/95らしい処理の書き方
• 昇順クイックソート
recursive function qsort(data) result(sorted)
implicit none
integer,intent(in) :: data(:)
integer :: sorted(1:size(data))
if(size(data) > 1)then
sorted = (/ qsort(pack(data(2:),data(2:)< data(1))), &!pack関数のフィルタを
data(1), &!>, <=に変更すれば
qsort(pack(data(2:),data(2:)>=data(1))) /) !降順
else
sorted = data
end if
end function qsort

1次元移流方程式のCUDA Fortran
実装（例）

支配方程式
• 1次元移流方程式
• 空間微分
• 2次精度中心差分
• 時間積分
• 1次精度Euler法
0





x
f
c
t
f
t : 時間
c : 移流速度
x : 空間方向
x
fn+1
x
fn
t c

プログラム作成，実行環境
• 開発環境
• Microsoft Visual Studio Community 2013
• PGI Accelerator Compiler 15.7 + CUDA 6.5
• コンパイルオプション
• ‐fast ‐Mcuda (‐McudaはGPU向けにコンパイルする場合のみ)
• 実行環境
• OS Windows 8.1
• CPU Core i7 920 (2.66GHz)
• メモリ 6GB
• GPU NVIDIA GTX Titan

メインルーチン
program main
use parameters
use kernel
implicit none
real(8),allocatable ::   f   (:)
real(8),allocatable :: d_f_dx(:)
integer :: n
allocate(  f   (Nx))
allocate(d_f_dx(Nx))
call initialize(f)
call output(f,"f_start.txt")
do n=1,Nt
call computeDifference(d_f_dx,f)
call integrate(f,d_f_dx)
end do
call output(f,"f_end.txt")
deallocate(  f   )
deallocate(d_f_dx)
end program main
program main
use parameters
use kernel
implicit none
integer :: n
allocate(  f   (Nx))
call initialize(f)
do n=1,Nt
end do
deallocate(  f   )
deallocate(d_f_dx)
end program main

モジュール（計算パラメータ）
module parameters
implicit none
private
public :: PI2, Lx, Nx, dx, dx2, conv, dt, Nt
real(8),parameter :: PI  = 3.1415926535897932384626433832795d0
real(8),parameter :: PI2 = 6.283185307179586476925286766559d0
real(8),parameter :: Lx  = 1d0
integer,parameter :: Nx = 2**20
real(8),parameter :: dx  = Lx/dble(Nx‐1)
real(8),parameter :: dx2 = Lx/dble(Nx‐1)*2d0
real(8),parameter :: conv = 1d0
real(8),parameter :: dt = 1d‐5
real(8),parameter :: endT = 0.5d0
integer,parameter :: Nt = int(endT/dt)
end module parameters
計算条件
計算領域 Lx = 1 m
分割数 Nx = 220（最大）
移流速度 c = 1 m/s
時間間隔 t = 10−5 s
終了時間 t = 0.5 s

モジュール（サブルーチン群）
module kernel
use parameters
implicit none
contains
!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!
subroutine initialize(f)               !関数値の初期化
:
end subroutine initialize
!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!
subroutine computeDifference(d_f_dx,f) !空間微分
:
end subroutine computeDifference
!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!
subroutine integrate(f,d_f_dx)         !時間積分
:
end subroutine integrate
!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!
subroutine output(value,filename)      !ファイル出力
:
end subroutine output
!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!
end module kernel

•
関数値の初期化
subroutine initialize(f)
use parameters,only:Nx,dx,Lx,PI2
implicit none
real(8),intent(inout) :: f(Nx)
integer :: i
do i = 1,Nx
f(i) = ((1d0‐cos(PI2*dble(i‐1)*dx/Lx))/2d0)**10
end do
!配列構成子とdo反復を用いた書き方
!f = (/ ( ((1d0‐cos(PI2*dble(i‐1)*dx/Lx))/2d0)**10, i=1,Nx ) /)
0 0.2 0.4 0.6 0.8 1
0
0.5
1    
10
/2cos1
2
1






 xLxxf 
x
f

空間微分
subroutine computeDifference(d_f_dx,f)
use parameters,only:Nx,dx2
implicit none
real(8),intent(out) :: d_f_dx(Nx)
real(8),intent(in) :: f (Nx)
integer :: i
i=1
d_f_dx(i) = (‐3d0*f(i)+4d0*f(i+1)‐f(i+2))/dx2
do i=2,Nx‐1
d_f_dx(i) = (f(i+1)‐f(i‐1))/dx2
end do
i=Nx
d_f_dx(i) = ( 3d0*f(i)‐4d0*f(i‐1)+f(i‐2))/dx2
空間微分の計算
















Δx
fff
Δx
ff
Δx
fff
dx
df
xxx NNN
ii
2
43
2
2
43
21
11
321

時間積分
subroutine integrate(f,d_f_dx)
use parameters,only:Nx,conv,dt
implicit none
real(8),intent(inout) :: f (Nx)
real(8),intent(in ) :: d_f_dx(Nx)
integer :: i
!1次精度Euler法による積分
do i = 1,Nx
f(i) = f(i) ‐ conv*dt*d_f_dx(i)
end do
!配列演算を利用した書き方
!f = f ‐ conv*dt*d_f_dx
dx
df
tcΔf
dt
df
Δtff
n
nnn
1

ファイル出力
• 自動再割付配列
• 代入される配列の大きさ
に応じて，動的配列の形
状が自動で調整
• 可変長文字列
• 文字列の長さをコロン(:)
で宣言
• character(:),allocat
able :: 変数名
• 引数で受け取る時はアス
タリスク(*)
subroutine output(value,filename)
use parameters
implicit none
real(8),intent(in) :: value(Nx)
!filenameは可変長文字列として受け取る
character(*) :: filename
integer :: i
open(unit=100,file=filename)
do i=1,Nx
write(100,*) (i‐1)*dx, value(i)
end do
close(100)

実行結果
• fが一定速度cで+x方向へ移流
0 0.2 0.4 0.6 0.8 1
0
0.5
1 t=0 t=0.1 t=0.2 t=0.3 t=0.4 t=0.5
x
f

GPUへの移植
• CUDA Cと比較して若干簡素
• エラーを考慮しなければ変更箇所を少なくできる
• GPUの制御を隠して数値計算に集中
• CとFortranにおけるメモリの取り扱い
• Cはポインタが基本
• メモリ割付け関数を変えることでホスト変数†とデバイス変数‡を区別
• Fortranは変数が基本
• 変数に属性を追加することでホスト変数とデバイス変数を区別
• 関数の明示的な変更を隠蔽
†CPU側のメモリに確保される通常の変数
‡GPU側のメモリに確保される変数

GPUへの移植
• ファイル拡張子を.cufに変更
• GPUの都合を反映
• サブルーチンにattributes(global)を付与
• サブルーチン名と引数の間に<<<:,:>>>を追加
• 実行時の並列度の指定
• サブルーチンには1スレッドが処理する内容を記述
• GPUで使うメモリにdevice属性を付与
• GPUとのデータの受け渡しには代入演算子(=)が利用可能

program main
use parameters
use kernel
implicit none
integer :: n
allocate(  f   (Nx))
call initialize(f)
do n=1,Nt
end do
deallocate(  f   )
deallocate(d_f_dx)
end program main
メインルーチン（GPU版）
program main
use cudafor
use parameters
use kernel                              !モジュールを直接書き換える
implicit none
real(8),allocatable,device ::   f   (:) !device属性を付与してデバイス変数とする
real(8),allocatable,device :: d_f_dx(:) !
integer :: n
allocate(  f   (Nx)) !メモリ確保は変更無し
allocate(d_f_dx(Nx)) !
call initialize<<<Block,Threads>>>(f) !実行時の並列度の指定
do n=1,Nt
call computeDifference<<<Block,Threads>>>(d_f_dx,f)
call integrate<<<Block,Threads>>>(f,d_f_dx)
end do
deallocate(  f   ) !メモリ解放は変更無し
deallocate(d_f_dx) !
end program main

モジュール（GPU版サブルーチン群）
module kernel
use cudafor
use parameter,only:Nx
implicit none
type(dim3),parameter :: Threads = dim3(min(Nx,256) ,1,1)
type(dim3),parameter :: Blocks = dim3(Nx/Threads%x,1,1)
contains
!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!
:
!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!
end module kernel
カーネル†呼出時の並列度の指定に利用す
る派生型（dim3型）パラメータを宣言
定義された成分の値を全て列挙することで，
派生型名（ここではdim3）をコンストラクタと
して利用可能
†GPUで実行されるサブルーチンの総称

implicit none
real(8),intent(inout) :: f(Nx)
integer :: i
do i = 1,Nx
end do
•
関数値の初期化（GPU版）
attributes(global)& !GPUで実行するカーネルと認識させる
subroutine initialize(f) !ためにattributs(global)を付与
implicit none
real(8),device,intent(inout) :: f(Nx)
integer :: i
i = (blockIdx%x‐1)*blockDim%x + threadIdx%x !GPUのスレッドと配列添字の対応付け
do i = 1,Nx
end do
    
10
/2cos1
2
1






 xLxxf 

空間微分（GPU版）
attributes(global) subroutine computeDifference(d_f_dx,f)
use parameter,only:Nx,dx2
implicit none
real(8),device,intent(out) :: d_f_dx(Nx)
real(8),device,intent(in) :: f (Nx)
integer :: i
i = (blockIdx%x‐1)*blockDim%x + threadIdx%x
if(i==1)then
d_f_dx(i) = (‐3d0*f(i)+4d0*f(i+1)‐f(i+2))/dx2
else&
if(1<i .and. i<Nx)then
d_f_dx(i) = (f(i+1)‐f(i‐1))/dx2
else&
if(i==Nx)then
d_f_dx(i) = ( 3d0*f(i)‐4d0*f(i‐1)+f(i‐2))/dx2
end if

時間積分（GPU版）
attributes(global) subroutine integrate(f,d_f_dx)
use parameter,only:Nx,conv,dt
implicit none
real(8),device,intent(inout) :: f (Nx)
real(8),device,intent(in) :: d_f_dx(Nx)
integer :: i
do i = 1,Nx
f(i) = f(i) ‐ dt*conv*d_f_dx(i)
end do

ファイル出力（GPU版）
use parameters
implicit none
real(8),device,intent(in) :: value(Nx)
real(8),allocatable :: host_value(:)
integer :: i
allocate(host_value, source = value)
do i=1,Nx
write(100,*) (i‐1)*dx, host_value(i)
end do
close(10)
deallocate(host_value)
source指定子による変数の
クローンの作成
この1行で
・配列valueのサイズの確認
・host_valueのメモリ確保
・データのコピー（GPU→CPU）
を実行

ファイル出力（GPU版）
use parameters
implicit none
real(8),device,intent(in) :: value(Nx)
real(8),allocatable,save :: host_value(:)
integer :: i
if(.not.allocated(host_value)) allocate(host_value(Nx))
host_value = value
do i=1,Nx
write(100,*) (i‐1)*dx, host_value(i)
end do
close(10)
save属性を付与し，プログラ
ムの終了までメモリの状態（割
り付け済みか否か）を保持
関数allocated()で状態を
確認し，未割付の場合のみ
allocateでメモリ確保

実行結果（1ステップあたりの実行時間）
配列サイズ
Nx
実行時間[ms]
CPU GPU
210 0.0190 0.120
212 0.0720 0.110
214 0.280 0.150
216 1.20 0.160
218 7.00 0.560
220 33.0 1.90

210 212 214 216 218 220
102
101
100
10-1
10-2
CPU
GPU
配列サイズNx
実行時間[ms]

CPUコードとGPUコードの共存
• 移流方程式は規模が小さく，処理が簡単
• CPUコードを保持せず，直接書き換えることができた
• 規模が大きい場合
• CPUコードから徐々に（サブルーチン毎に）GPUへ移植
• CPUコードとGPUコードの混在と切替が必要
• CPUコードと同じソースに追記
• CPUコードとは別のソースを新しく作り，そこに記述
• GPUの利用に直接関係ない箇所の変更は極力少なくしたい

CPUコードと同じソースに追記
• CPUコードのファイル拡張子を.cufに変更
• カーネルを追加
• 当然カーネル名はサブルーチン名と異なる
• 手続きのオーバーロード
• CPUで実行する手続きとGPUで実行するカーネルを共通の
名前で呼び出し
• 引数（ホスト変数かデバイス変数か）に応じて呼び出される
手続きが変化

カーネルが追記されたモジュール
module kernel
use cudafor
use parameters
implicit none
: !カーネル実行時の情報を定義
interface initialize
module procedure initialize
module procedure cufInitialize
end interface
contains
!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!
:
!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!
:
!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!
attributes(global) subroutine cufInitialize(f)
:
end subroutine cufInitialize
!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!
end module kernel
interfaceを定義し，手続
き名をinitializeでオー
バーロード
少なくとも一つの引数の型・
属性が異なっている必要が
ある

オーバーロードによる呼び出しの切替
• ホスト変数を渡す場合
• デバイス変数を渡す場合
real(8),allocatable :: f(:)
:
call initialize(f) !サブルーチンinitializeが呼ばれる
:
real(8),allocatable,device :: f(:)
:
call initialize(f) !コンパイルエラー
: !カーネルcufInitializeが呼ばれるためシェブロン（<<<,>>>）が必要
標準の並列度を定めておき，<<<,>>>が無い場合は標準の並列度，
ある場合にはその並列度を使ってくれるようになると非常にうれしい

CPUコードと別のソースに記述
• 新しいファイルを作成してカーネルを記述
• moduleが異なれば同じ名前の手続きを定義可能
• 同じ名前の手続きが定義されたmoduleをuseすると名前が
衝突
• CPU版とGPU版で関数名を区別せず，呼出元の変更を限定したい
• 参照名を変更することで対処
• use モジュール名, 参照名=>モジュール内の手続き名
• 参照名が複数衝突した場合は後で読み込まれた方が有効

CPU版とGPU版のモジュール
CPU版 GPU版
module kernel
use parameters
implicit none
: !実行に必要なパラメータを定義
contains
!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!
:
:
!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!
:
end module kernel
module cufKernel
use cudafor
use parameters
implicit none
: !実行に必要なパラメータを定義
contains
!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!
attributes(global)&
:
!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!
:
end module cufKernel

参照名を変更した手続きの呼出
program main
use parameters
use kernel   ,initializeKernel=>initialize
use cufKernel,initializeKernel=>initialize
implicit none
real(8),allocatable,device ::   f   (:)
real(8),allocatable,device :: d_f_dx(:)
integer :: n
allocate(  f   (Nx))
call initializeKernel<<<Block,Threads>>>(f)
do n=1,Nt
call computeDifference<<<Block,Threads>>>(d_f_dx,f)
call integrate<<<Block,Threads>>>(f,d_f_dx)
end do
deallocate(  f   (Nx))
deallocate(d_f_dx(Nx))
end program main
参照名を設定することで呼
出時の手続き名を変更
参照名が重複した場合は
後で定義した方が有効

ポインタを利用した配列コピーの回避
program main
use parameters
use kernel
implicit none
integer :: n
allocate(  f   (Nx))
call initialize(f)
do n=1,Nt
end do
deallocate(  f   )
deallocate(d_f_dx)
end program main
微分値を変数d_f_dxに書き
込み，積分の際に読み出す
ため非効率

微分と積分のフュージョン
subroutine computeDifferenceAndIntegrate(fnew,f)
use parameters
implicit none
real(8),intent(out) :: fnew(Nx)
real(8),intent(in ) :: f (Nx)
real(8) :: d_f_dx
integer :: i
i=1
d_f_dx = (‐3d0*f(i)+4d0*f(i+1)‐f(i+2))/dx2
fnew(i) = f(i) ‐ conv*dt*d_f_dx
do i=2,Nx‐1
d_f_dx = (f(i+1)‐f(i‐1))/dx2
end do
i=Nx
d_f_dx = ( 3d0*f(i)‐4d0*f(i‐1)+f(i‐2))/dx2
end subroutine computeDifferenceAndIntegrate
微分を計算して直ちに積分に
利用
fに書き込むと微分値が正しく
求められないため，更新された
値を保持する変数fnewを追加

微分と積分のフュージョン
program main
use parameters
use kernel
implicit none
real(8),allocatable :: f   (:) !時刻nの値
real(8),allocatable :: fnew(:) !時刻n+1の値
integer :: n
allocate(f   (Nx))
allocate(fnew(Nx))
call initialize(f)
do n=1,Nt
call computeDifferenceAndIntegrate(fnew, f)
f = fnew
end do
deallocate(f   )
deallocate(fnew)
end program main
fnewの値をfにコピーして次の
時刻の積分に備える
配列のアドレスが交換できれば
配列の全要素のコピーを回避

Fortranのポインタ
• pointer属性を付与して宣言
• ポインタ変数と変数を結合するには => を利用
• 結合される変数にはtarget属性が必要
• ポインタ変数の型・属性と結合できる変数の型・属性が厳格
に対応
• 変数と結合後は通常の変数と同じように利用可能
• デバイス変数を指すにはdevice,pointer属性を付
与して宣言

program main
use parameters
use kernel
implicit none
real(8),allocatable,target :: f (:)
real(8),allocatable,target :: fnew(:)
integer :: n
real(8),dimension(:),pointer :: ptr_fcrnt
real(8),dimension(:),pointer :: ptr_fnew
real(8),dimension(:),pointer :: ptr_swap
allocate(f (Nx))
allocate(fnew(Nx))
ptr_fcrnt => f
ptr_fnew => fnew
ptr_swap => null()
real(8),pointer :: ptr_fcrnt(:) と
宣言しても「ポインタ変数の配列」とはならず，
「1次元実数型配列へのポインタ」となる

call initialize(ptr_fcrnt)
call output(ptr_fcrnt,"f_start.txt")
do n=1,Nt
call computeDifferenceAndIntegrate(ptr_fnew, ptr_fcrnt)
ptr_swap => ptr_fcrnt
ptr_fcrnt => ptr_fnew
ptr_fnew => ptr_swap
end do
call output(ptr_fcrnt,"f_end.txt")
ptr_f_crnt => null()
ptr_f_new => null()
ptr_swap => null()
deallocate(f )
deallocate(fnew)
end program main
手続きの引数としてポインタを
渡す
Fortranのポインタ変数は，
ポインタとしても配列としても
利用可能

（GPU版）
program main
use cudafor
use parameters
use kernel
implicit none
real(8),allocatable,device,target :: f (:)
real(8),allocatable,device,target :: fnew(:)
integer :: n
real(8),dimension(:),device,pointer :: ptr_fcrnt
real(8),dimension(:),device,pointer :: ptr_fnew
real(8),dimension(:),device,pointer :: ptr_swap
allocate(f (Nx))
allocate(fnew(Nx))
ptr_fcrnt => f
ptr_fnew => fnew
ptr_swap => null()
デバイス変数を指すポインタ
を宣言すれば，デバイス変数
と結合でき，配列としても利用
できる

（GPU版）
call initialize<<<Blocks,Threads>>>(ptr_fcrnt)
do n=1,Nt
call computeDifferenceAndIntegrate<<<Blocks,Threads>>>
(ptr_fnew, ptr_fcrnt)
ptr_swap => ptr_fcrnt
ptr_fcrnt => ptr_fnew
ptr_fnew => ptr_swap
end do
ptr_f_crnt => null()
ptr_f_new => null()
ptr_swap => null()
deallocate(f )
deallocate(fnew)
end program main

微分と積分のフュージョン（GPU版）
attributes(global) subroutine computeDifferenceAndIntegrate(fnew,f)
use parameters
implicit none
real(8),device,intent(out) :: fnew(Nx)
real(8),device,intent(in ) :: f (Nx)
real(8) :: d_f_dx
integer :: i
if(i==1)then
d_f_dx = (‐3d0*f(i)+4d0*f(i+1)‐f(i+2))/dx2
else&
if(1<i .and. i<Nx)then
d_f_dx = (f(i+1)‐f(i‐1))/dx2
else&
if(i==Nx)then
d_f_dx = ( 3d0*f(i)‐4d0*f(i‐1)+f(i‐2))/dx2
end if
fnew(i) = f(i) ‐ conv*dt*d_f_dx !時間積分
end subroutine computeDifferenceAndIntegrate
空間微分を計算

配列サイズ
Nx
実行時間[ms]
CPU GPU
210 0.0140 0.0740
212 0.0550 0.0600
214 0.220 0.0600
216 0.800 0.0840
218 4.40 0.290
220 14.0 0.970

210 212 214 216 218 220
102
101
100
10-1
10-2
CPU
GPU
CPU
GPU
単純実装
ポインタ利用
配列サイズNx
実行時間[ms]

手続き内で配列を扱う時の落とし穴
• Modern Fortranでは，仮引数の配列要素数の指
定が不要
• 配列要素数が(:) 実引数から配列サイズを特定
• 配列要素数が(*) 配列要素数が不明
attributes(global) subroutine integrate(f,d_f_dx)
implicit none
real(8),device,intent(inout) :: f (:) !配列要素数Nxはカーネル内で取り扱わない
real(8),device,intent(in) :: d_f_dx(:) !ため，要素数は未指定でよい
integer :: i
f(i) = f(i) ‐ dt*conv*d_f_dx(i)
何の問題も無いように見えるが・・・

仮引数の配列要素数を(:)とした結果
210 212 214 216 218 220
102
101
100
10-1
10-2
CPU
GPU
単純実装
配列サイズNx
実行時間[ms]
CPU
GPU
要素数(:)

仮引数の配列要素数を(:)とした結果
• CPU（Fortran 90/95/2003）
• 実行速度には影響しない
• 若干遅くなる傾向を示すが，配列サイズが小さい場合には高速
化することもある
• GPU(CUDA Fortran)
• 実行速度が著しく低下
• 実行速度は問題規模によらずほぼ一定
• 何が原因？
• Modern Fortranの機能がGPUで利用できたとしても，実
行速度に影響がないか確認する必要がある

オブジェクト指向プログラミング
の導入による移植範囲の限定

オブジェクト指向プログラミング
• この世にあるモノの振る舞いを表現する
• イヌもネコも哺乳類で・・・
• オブジェクト同士がメッセージを交換しあいながら相
互作用を・・・

Fortranによるオブジェクト指向プログラミング
• プログラミング方法論の一つ
• 関係するデータと処理を一括して取り扱う
• 一括して取り扱うことで色々お得なことがある
• 派生型type(*)にサブルーチンを追加

Fortran Java C++
derived type
（派生型）
class class
component
（成分）
field data member
type‐bound
procedure
（手続き）
method
virtual member
function
用語の対応

オブジェクト指向プログラミングによる
数値計算
• オブジェクト指向プログラミングは高コスト
• 手続き型プログラミングよりも処理の回数，メモリの使用量，
処理時間が増加
• 多少冗長でも保守性，拡張性，再利用性の確保を重要視
• プログラム作成時の人的ミスの排除
• FORTRAN77スタイルのプログラムの取り込み
• 既存のプログラムのシームレスな拡張
• 死蔵されたプログラムを統合する枠組みを作成したい

移流計算のオブジェクト指向化
• 手続き型
• オブジェクト指向プログラミング（このように書きたい）
real(8) :: f(N), d_f_dx(N)
do n=1, n_end
call computeDifference(f, d_f_dx, N)
f(:) = f(:) ‐ dt*c*d_f_dx(:)
end do
type(Field) :: f
do n=1, n_end
f = f ‐ dt*c*f%x()
end do
物理量とその微分値
を個別に宣言
微分値と微分の
計算が分離
物理量と微分値，微分の
計算を包括した派生型
書籍等に書かれている式
との類似性を持たせる

移流方程式の一つの見方
• 場は複数の物理量が集
まって作られる
• 物理量の値とその微分値
は不可分
• 微分は各物理量に対する
処理
• 積分は場（全物理量）に対
する処理
0





x
f
c
t
f
場
物理量 f 物理量
値
微分値
値
微分値

値を取り扱うarray型の定義
type :: array
real(8),allocatable,private :: array(:)
contains
procedure,public,pass :: construct         !成分arrayを動的確保
procedure,public,pass :: destruct         !確保されたarrayを解放
procedure,public,pass :: all               !成分arrayへのポインタを返す手続き
procedure,public,pass :: getPointer !自身のポインタを返す手続き
procedure,public,pass :: assign
procedure,public,pass :: add
procedure,public,pass :: multiplyScalar
procedure,public,pass :: divideScalar
generic :: assignment(=) =>   assign
generic ::   operator(+) =>      add
generic ::   operator(*) => multiplyScalar
generic ::   operator(/) =>   divideScalar
end type array
演算子のオーバーロード
によって四則演算を定義
必要な演算子
= 配列の代入
+ 配列同士の加算
* 配列とスカラ変数の乗算
/ 配列とスカラ変数の除算
target属性は付与不可
pointer属性は付与可能
だが挙動が怪しい

subroutine assign(lhs,rhs)
implicit none
class(array),intent(inout) :: lhs
class(array),intent(in   ) :: rhs
lhs%array(:) = rhs%array(:)
end subroutine assign
function add(term1,term2) result(sum)
use parameters
implicit none
class(array),intent(in)  :: term1
class(array),allocatable :: sum
allocate(sum)
call sum%construct(Nx)
sum%array(:) = term1%array(:)+term2%array(:)
end function add
代入演算を行う手続き
array型変数に対して代
入演算子(=)が用いられる
とこの手続きが呼び出され
る
加算を行う手続き
array型同士の加算演算
子(+)が記述されるとこの
手続きが呼び出される

function all(this) result(realPtr)
use iso_c_binding
use parameters
implicit none
class(array),intent(in) :: this
real(8),dimension(:),pointer :: realPtr
call c_f_pointer( c_ptr(c_loc(this%array)),&
realPtr, (/Nx/) )
end function all
!getPointerを呼び出したarray型オブジェクトへの
!ポインタを返す手続き
function getPointer(this) result(ptr)
implicit none
class(array),intent(in),target :: this
type(array),pointer :: ptr
ptr=>this
end function getPointer
派生型arrayの成分array（実
数型配列）へのポインタを返す
手続き
派生型の成分はtarget属性
を持てないので，C言語のポイ
ンタを作成してからFortranの
ポインタへ変換
c_loc 変数のアドレスを取り出す
c_ptr Cのポインタ型type(c_ptr)
のコンストラクタ
c_f_pointer CのポインタをFortran
のポインタに変換
privateで隠蔽した変数を書き換えるこ
とができてしまう！

物理量を表すScalarVariable型の定義
type :: ScalarVariable
type(array),public :: value
type(array),public :: d_v_dx
type(array),public :: d_v_dt
logical,public :: d_v_dxCalculated = .false.
logical,public :: d_v_dtCalculated = .false.
logical,public :: updated          = .false.
contains
procedure,public,pass :: construct         !値，微分値のコンストラクタを呼び出す
procedure,public,pass ::  destruct         !値，微分値のデストラクタを呼び出す
procedure,public,pass :: initialize
procedure,public,pass :: x
procedure,public,pass :: update
procedure,public,pass :: addArray
generic ::   operator(+) => addArray
end type ScalarVariable
物理量の値と空間微分値，
時間微分値を定義
物理量に対する初期化と
空間微分は処理を定義

物理量を表すScalarVariable型の定義
subroutine initialize(this)
use kernel, initializeKernel=>initialize
implicit none
class(ScalarVariable) :: this
call initializeKernel(this%value%all())
function x(this) result(d_v_dx)
use parameters,only:Nx
use kernel,computeDifferenceKernel=>computeDifference
implicit none
type(array),pointer :: d_v_dx
call computeDifferenceKernel(this%d_v_dx%all(),this%value%all())
d_v_dx => this%d_v_dx%getPointer()
this%d_v_dxCalculated = .true.
this%updated = .false.
end function x
初期化を行う手続き
処理の切替を容易にする
ために他のmoduleで定義
された手続きを呼出し
空間微分を
行う手続き
他のmodule
の手続きを
呼出し

場を表すField型の定義
type :: Field
type(ScalarVariable),private :: f
contains
procedure,public,pass :: construct         !各物理量のコンストラクタを呼び出す
procedure,public,pass ::  destruct         !各物理量のデストラクタを呼び出す
procedure,public, pass :: initialize
procedure,private,pass :: x
procedure,public ,pass :: t
procedure,private,pass :: update
procedure,public,pass :: addArray
generic ::   operator(+) => addArray
end type Field
物理量を成分として保持
（ここではfのみ）
場の初期化を行う手続き
や空間微分，時間微分を
計算する手続きを定義
実際は各物理量型の初期
化手続きや空間微分計算
の手続きを呼び出す

場を表すField型の定義
function t(this) result(d_f_dt)
use parameters
use class_array
implicit none
class(Field) :: this
type(array),pointer :: d_f_dt
this%f%d_v_dt = this%x()*‐conv
d_f_dt => this%f%d_v_dt%getPointer()
this%f%d_v_dtCalculated = .true.
this%f%updated = .false.
end function t
use class_array
implicit none
class(Field) :: this
d_v_dx=>this%f%x()
end function x
場の時間微分
を計算する手続き
この手続きで移流方程式
を表現
x
f
c
t
f





場の空間微分を計算
各物理量の空間微分計算
の手続きを呼び出す

メインルーチン
program main
use parameters
use class_Field
implicit none
type(Field) :: f
integer :: n
call f%initialize()
do n=1,Nt
print *,n
f = f + f%t()*dt
end do
end program main
書籍等に書かれているEuler法の定義と
同じ書き方ができている

各派生型と手続きの呼出
Field
物理量
場の初期化
時間微分の計算
代入演算子
加算演算子
ScalarVariable
値
時間微分値
空間微分値
値の初期化
代入演算子
加算演算子
program main
use parameters
use class_Field
implicit none
type(Field) :: f
integer :: n
call f%initialize()
do n=1,Nt
print *,n
f = f + f%t()*dt
end do
end program main
array
値
代入演算子
加算演算子
乗算演算子
除算演算子
値の初期化
型の利用
手続きの呼出

Field
物理量
場の初期化
代入演算子
加算演算子
ScalarVariable
値
時間微分値
空間微分値
値の初期化
代入演算子
加算演算子
array
値
代入演算子
加算演算子
乗算演算子
除算演算子
program main
use parameters
use class_Field
implicit none
type(Field) :: f
integer :: n
call f%initialize()
do n=1,Nt
print *,n
f = f + f%t()*dt
end do
end program main
値の初期化
型の利用
手続きの呼出

メインルーチン（修正Euler法へ変更）
program main
use parameters
use class_Field
implicit none
type(Field) :: f
type(Field) :: f05
integer :: n
call f%initialize()
do n=1,Nt
print *,n
f05 = f + f%t()*dt
f = f + (f%t()+f05%t())/2d0*dt
end do
end program main
時間積分を修正Euler法へ変更
手続きを一切追加することなく，書籍に
書かれた式と同じ書き方で変更可能

GPUへの移植
• 数値を取り扱うarray型
• 四則演算をGPUで実行するカーネルを作成
• 変数を表すScalarVariable型
• 初期化や微分の計算をGPUで実行するカーネルを作成
• 既存のカーネルを流用可能
• 流用する場合の変更は2行のみ
• そもそも派生型の手続きとしてカーネルは定義できない
• 今後定義できるようになるかは不明
• 必ず外部モジュールを呼ぶ必要がある
• 場を表すField型
• 変更無し

array型（GPU版）
type :: array
real(8),allocatable,private,device :: array(:)
contains
procedure,public,pass :: construct         !成分arrayを動的確保
procedure,public,pass ::  destruct         !確保されたarrayを解放
procedure,public,pass :: all               !成分arrayへのポインタを返す手続き
procedure,public,pass :: getPointer !自身のポインタを返す手続き
procedure,public,pass :: add
procedure,public,pass :: multiplyScalar
procedure,public,pass :: divideScalar
generic ::   operator(+) =>      add
generic ::   operator(*) => multiplyScalar
generic ::   operator(/) =>   divideScalar
end type array

array型の演算子（GPU版）
subroutine assign(lhs,rhs)
implicit none
class(array),intent(inout) :: lhs
class(array),intent(in   ) :: rhs
integer :: stat
stat = cudaMemcpy(lhs%array,rhs%array,Nx,cudaMemcpyDeviceToDevice)
end subroutine assign
function add(term1,term2) result(sum)
use cufParameters,only:Blocks,Threads
use arrayOperator,only:addKernel=>addArrayKernel
implicit none
class(array),allocatable :: sum
allocate(sum)
call sum%construct(Nx)
call addKernel<<<Blocks, Threads>>>(sum%array,term1%array,term2%array)
end function add
代入演算はcudaMemcpy
へ変更
加算を実行するカーネルを
作成し，加算演算子をオー
バーロードしている手続き
内から呼び出す

加算演算を行うカーネル
attributes(global) subroutine addArrayKernel(result,term1,term2)
implicit none
real(8),intent(out),device :: result(Nx)
real(8),intent(in ),device :: term1(Nx)
real(8),intent(in ),device :: term2(Nx)
integer :: i
result(i) = term1(i) + term2(i)
end subroutine addArrayKernel

ScalarVariable型の変更箇所（GPU版）
subroutine initialize(this)
use cufKernel, only:initializeKernel=>cufinitialize
use cufParameters
implicit none
call initializeKernel<<<Blocks, Threads>>>(this%value%all())
use cufKernel,only:computeDifferenceKernel=>cufComputeDifference
use cufParameters
implicit none
call computeDifferenceKernel<<<Blocks, Threads>>>
(this%d_v_dx%all(),this%value%all())
d_v_dx => this%d_v_dx%getPointer()
end function x
既存（前のスライドで作成
済み）の初期化カーネルを
呼び出し
既存（前のスライドで作成
済み）の空間微分カーネル
を呼び出し

メインルーチン（変更無し）
program main
use parameters
use class_Field
implicit none
type(Field) :: f
integer :: n
call f%initialize()
do n=1,Nt
print *,n
f = f + f%t()*dt
end do
end program main

Field
物理量
場の初期化
代入演算子
加算演算子
ScalarVariable
値
時間微分値
空間微分値
値の初期化
代入演算子
加算演算子
加算
乗算
除算
array
値
代入演算子
加算演算子
乗算演算子
除算演算子
値の初期化
GPUで実行するために
カーネルを作成，あるい
は既存のカーネルを流用
カーネルを呼び出すよう
に若干変更
device属性の追加
手続きの呼出

配列サイズ
Nx
実行時間[ms]
CPU GPU
210 0.190 3.10
212 0.800 2.50
214 3.90 2.70
216 19.0 8.00
218 98.0 17.0
220 415 36.5

210 212 214 216 218 220
103
102
101
100
10-1
10-2
CPU
GPU
CPU
GPU
OOP
配列サイズNx
実行時間[ms]
ポインタ利用

おわりに
• Fortran 90/95/2003(Modern Fortran)の機能
を簡単に紹介
• Modern Fortranの機能を使い，1次元移流方程式
の計算を実行，GPUへ移植
• CPUコードで利用できる機能の大半はGPUでも利用可能
• CUDA Fortranでは実行時間が著しく変化する場合がある
• 1次元移流方程式のプログラムをオブジェクト指向プ
ログラミングにより作成し，GPUへ移植
• GPU移植に伴う変更の範囲を限定できる

まとめ
• 極めて有用
• メモリ管理
• allocate/deallocate, pointer, source指定子
• 配列を引数にとる場合は配列要素数の指定に注意が必要
• 使いどころはある
• サブルーチンのオーバーロード，参照名
• オブジェクト指向プログラミング（実行制御，カプセル化）
• 使い物にならない
• オブジェクト指向プログラミング（型に対する演算の定義）
• 一時オブジェクトの生成と破棄が高負荷

Cuda fortranの利便性を高めるfortran言語の機能

More Related Content

What's hot

Viewers also liked

Similar to Cuda fortranの利便性を高めるfortran言語の機能

More from 智啓 出川

Cuda fortranの利便性を高めるfortran言語の機能

More from 智啓出川