Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...Takahiro Katagiri
In this study, we show a new ability of auto-tuning (AT) by utilizing selection of code variants based on totally different implementations of numerical computations. The selection function of the AT is carefully designed to apply ppOpen-AT, which is a computer language to adapt AT functions to simulation codes of actual use in ppOpen-HPC project. The AT is evaluated with ppOpen-APPL/FDM (Seism_3D), which is a simulation code of seismic wave based on Finite Difference Method (FDM). According to results of performance evaluation with an advanced multi-core processor, the Xeon Phi, crucial speedups are found by utilizing the selection of AT. Moreover, the best code variants were varied according to parallel executions, i.e. the number of MPI processes and OpenMP threads in hybrid MPI/OpenMP.
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...Takahiro Katagiri
We have developed a parallel eigensolver for very small-size matrices. Unlike conventional solvers, our design policy focusses on nature of non-blocking computations and reduced communications. A communication-avoiding approach for Householder pivot vectors is used to implement part of Householder inverse transformation. In addition to that, we implement some techniques for reducing communications by using non-blocking communications in tridiagonalization part. Performance of the solver with full nodes in the Fujitsu FX10 (76,800 cores) is also presented.
本報告では,自動チューニング(AT)を実行するに当たり,コード最適化時に動的なコード生成とコンパイルを行わず,実行前に静的に生成したコードのみを利用するATソフトウェア構成方式のStatic Code Generation Auto-tuning (SCG-AT)を提案する.SCG-ATによるATを評価するにあたり「階層型AT処理」を実装した.差分法による地震波シミュレーションppOpen-APPL/FDMにおいて,従来のベクトル計算機向けコードと新規開発したスカラ計算機向けコードのコード選択処理を実装した.Xeon Phi,Ivy Bridge,およびFX10の3種の全く異なる計算機でSSG-ATによるコード選択のATを評価した.評価の結果,Xeon PhiとIvy Bridgeにおいてはスカラ計算機向けコードの選択により,従来行われていたAT方式では達成できない速度向上が達成できることを明らかにした.
------
ここに掲載した著作物の利用に関する注意 本著作物の著作権は情報処理学会に帰属します。本著作物は著作権者である情報処理学会の許可のもとに掲載するものです。ご利用に当たっては「著作権法」ならびに「情報処理学会倫理綱領」に従うことをお願いいたします。
Notice for the use of this material The copyright of this material is retained by the Information Processing Society of Japan (IPSJ). This material is published on this web site with the agreement of the author (s) and the IPSJ. Please be complied with Copyright Law of Japan and the Code of Ethics of the IPSJ if any users wish to reproduce, make derivative work, distribute or make available to the public any part or whole thereof.
All Rights Reserved, Copyright (C) Information Processing Society of Japan.
Comments are welcome. Mail to address editj@ipsj.or.jp, please.
In this research, we show effect of auto-tuning (AT) for function of code selection to computational kernels for scientific and technology computations. ppOpen-AT, which is a computer language to specify AT function to arbitrary parts of program, is utilized to describe the code selection. The evaluation of AT in this research performed with advanced CPU architectures, such as the Intel Xeon Phi and the Intel Ivy Bridge. Results of preliminary experiment with a code based on Finite Difference Method (FDM) indicate that the effect of AT is crucial with compared to conventional AT framework without code selection.
Auto‐Tuning of Hierarchical Computations with ppOpen‐ATTakahiro Katagiri
We are now developing ppOpen-AT, which is a directive-base Auto-tuning (AT) language to specify fundamental AT functions, i.e., varying values of parameters, loop transformations, and code selection. Considering with expected hardware of Post Moore’s era, we focus on optimization for computations with deep hierarchy of 3D memory stack. ppOpen-AT provides code selection to optimize code with respect to layers of the memory. Performance evaluation of AT with a code of FDM will be shown by utilizing the Xeon Phi.
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-ATTakahiro Katagiri
SPNS2013, December 5th -6th, 2013, Conference Room, 3F, Bldg.1, Earthquake Research Institute (ERI), The University of Tokyo, December 6th, 2013, ppOpen-HPC and Automatic Tuning (Chair: Hideyuki Jitsumoto), 1330-1400
Towards Automatic Code Selection with ppOpen-AT: A Case of FDM - Variants of ...Takahiro Katagiri
In this study, we show a new ability of auto-tuning (AT) by utilizing selection of code variants based on totally different implementations of numerical computations. The selection function of the AT is carefully designed to apply ppOpen-AT, which is a computer language to adapt AT functions to simulation codes of actual use in ppOpen-HPC project. The AT is evaluated with ppOpen-APPL/FDM (Seism_3D), which is a simulation code of seismic wave based on Finite Difference Method (FDM). According to results of performance evaluation with an advanced multi-core processor, the Xeon Phi, crucial speedups are found by utilizing the selection of AT. Moreover, the best code variants were varied according to parallel executions, i.e. the number of MPI processes and OpenMP threads in hybrid MPI/OpenMP.
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Usi...Takahiro Katagiri
We have developed a parallel eigensolver for very small-size matrices. Unlike conventional solvers, our design policy focusses on nature of non-blocking computations and reduced communications. A communication-avoiding approach for Householder pivot vectors is used to implement part of Householder inverse transformation. In addition to that, we implement some techniques for reducing communications by using non-blocking communications in tridiagonalization part. Performance of the solver with full nodes in the Fujitsu FX10 (76,800 cores) is also presented.
本報告では,自動チューニング(AT)を実行するに当たり,コード最適化時に動的なコード生成とコンパイルを行わず,実行前に静的に生成したコードのみを利用するATソフトウェア構成方式のStatic Code Generation Auto-tuning (SCG-AT)を提案する.SCG-ATによるATを評価するにあたり「階層型AT処理」を実装した.差分法による地震波シミュレーションppOpen-APPL/FDMにおいて,従来のベクトル計算機向けコードと新規開発したスカラ計算機向けコードのコード選択処理を実装した.Xeon Phi,Ivy Bridge,およびFX10の3種の全く異なる計算機でSSG-ATによるコード選択のATを評価した.評価の結果,Xeon PhiとIvy Bridgeにおいてはスカラ計算機向けコードの選択により,従来行われていたAT方式では達成できない速度向上が達成できることを明らかにした.
------
ここに掲載した著作物の利用に関する注意 本著作物の著作権は情報処理学会に帰属します。本著作物は著作権者である情報処理学会の許可のもとに掲載するものです。ご利用に当たっては「著作権法」ならびに「情報処理学会倫理綱領」に従うことをお願いいたします。
Notice for the use of this material The copyright of this material is retained by the Information Processing Society of Japan (IPSJ). This material is published on this web site with the agreement of the author (s) and the IPSJ. Please be complied with Copyright Law of Japan and the Code of Ethics of the IPSJ if any users wish to reproduce, make derivative work, distribute or make available to the public any part or whole thereof.
All Rights Reserved, Copyright (C) Information Processing Society of Japan.
Comments are welcome. Mail to address editj@ipsj.or.jp, please.
In this research, we show effect of auto-tuning (AT) for function of code selection to computational kernels for scientific and technology computations. ppOpen-AT, which is a computer language to specify AT function to arbitrary parts of program, is utilized to describe the code selection. The evaluation of AT in this research performed with advanced CPU architectures, such as the Intel Xeon Phi and the Intel Ivy Bridge. Results of preliminary experiment with a code based on Finite Difference Method (FDM) indicate that the effect of AT is crucial with compared to conventional AT framework without code selection.
Auto‐Tuning of Hierarchical Computations with ppOpen‐ATTakahiro Katagiri
We are now developing ppOpen-AT, which is a directive-base Auto-tuning (AT) language to specify fundamental AT functions, i.e., varying values of parameters, loop transformations, and code selection. Considering with expected hardware of Post Moore’s era, we focus on optimization for computations with deep hierarchy of 3D memory stack. ppOpen-AT provides code selection to optimize code with respect to layers of the memory. Performance evaluation of AT with a code of FDM will be shown by utilizing the Xeon Phi.
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-ATTakahiro Katagiri
SPNS2013, December 5th -6th, 2013, Conference Room, 3F, Bldg.1, Earthquake Research Institute (ERI), The University of Tokyo, December 6th, 2013, ppOpen-HPC and Automatic Tuning (Chair: Hideyuki Jitsumoto), 1330-1400
29. September bis 4. Oktober 2013, Dagstuhl Seminar 13401, Automatic Application Tuning for HPC Architectures, Session: infrastructures, 10:30-11:00, October 1st (TUE) , 2013.
KSpeculative aspects of high-speed processor designssuser7dcef0
Highest total system speed
Ex. TOP500 speed,
Application speed of supercomputers
Highest processor chip performance
Ex. SPEC CPU rate
NAS parallel benchmarks
Highest single core performance
SPEC CPU int,
SPEC CPU fp
Dhrystone
A Parallel Computing-a Paradigm to achieve High PerformanceAM Publications
Over last few years there has been rapid changes found in computing field.today, we are using the latest
upgrade system which provides the faster output and high performance. User view towards computing is only to
get the correct and fast result. There are many techniques which improves the system performance. Today’s
widely use computing method is parallel computing. Parallel computing, including foundational and theoretical
aspects, systems, languages, architectures, tools, and applications. It will address all classes of parallelprocessing
platforms including concurrent, multithreaded, multicore, accelerated, multiprocessor, clusters, and
supercomputers. This paper reviews the overview of parallel processing to show how parallel computing can
improve the system performance.
29. September bis 4. Oktober 2013, Dagstuhl Seminar 13401, Automatic Application Tuning for HPC Architectures, Session: infrastructures, 10:30-11:00, October 1st (TUE) , 2013.
KSpeculative aspects of high-speed processor designssuser7dcef0
Highest total system speed
Ex. TOP500 speed,
Application speed of supercomputers
Highest processor chip performance
Ex. SPEC CPU rate
NAS parallel benchmarks
Highest single core performance
SPEC CPU int,
SPEC CPU fp
Dhrystone
A Parallel Computing-a Paradigm to achieve High PerformanceAM Publications
Over last few years there has been rapid changes found in computing field.today, we are using the latest
upgrade system which provides the faster output and high performance. User view towards computing is only to
get the correct and fast result. There are many techniques which improves the system performance. Today’s
widely use computing method is parallel computing. Parallel computing, including foundational and theoretical
aspects, systems, languages, architectures, tools, and applications. It will address all classes of parallelprocessing
platforms including concurrent, multithreaded, multicore, accelerated, multiprocessor, clusters, and
supercomputers. This paper reviews the overview of parallel processing to show how parallel computing can
improve the system performance.
Hyper-threading or Hyper-Threading Technology(HTT), is actually Intel’s
trademark for their multi-threading, but has become a common name for all
processors of this type. It is essentially a cut down version of dual version of dual
core. Execution units on a hyper-threaded CPU share certain elements, such as cache
and pipelines.
Sundance and TULIPP at the 10th Intelligent Imaging event, 29th April 2019, London.
For more information, please see http://tulipp.eu/ and https://www.sundance.com/
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
4. Background
• High‐Thread Parallelism (HTP)
– Multi‐core and many‐core processors are
pervasive.
• Multicore CPUs: 16‐32 cores, 32‐64 Threads with
Hyper Threading (HT) or Simultaneous Multithreading
(SMT)
• Many Core CPU: Xeon Phi – 60 cores, 240 Threads
with HT.
– Utilizing parallelism with full‐threads is
important.
4
Performance Portability (PP)
◦ Keeping high performance in multiple computer environments.
Not only multiple CPUs, but also multiple compilers.
Run‐time information, such as loop length and
number of threads, is important.
◦ Auto‐tuning (AT) is one of candidates technologies to establish
PP in multiple computer environments.
5. ppOpen‐HPC Project
• Middleware for HPC and Its AT Technology
– Supported by JST, CREST, from FY2011 to FY2015.
– PI: Professor Kengo Nakajima (U. Tokyo)
• ppOpen‐HPC
– An open source infrastructure for reliable simulation codes on
post‐peta (pp) scale parallel computers.
– Consists of various types of libraries, which covers
5 kinds of discretization methods for scientific computations.
• ppOpen‐AT
– An auto‐tuning language for ppOpen‐HPC codes
– Using knowledge of previous project: ABCLibScript Project.
– Auto‐tuning language based on directives for AT.
5
7. Design Philosophy of ppOpen‐AT
• A Directive‐base AT Language
– Multiple regions can be specified.
– Low cost description with directives.
– Do not prevent original code execution.
• AT Framework Without Script Languages and
OS Daemons (Static Code Generation)
– Simple: Minimum software stack is required.
– Useful: Our targets are supercomputers in operation,
or just after development.
– Low Overhead: We DO NOT use code generator with
run‐time feedback.
– Reasons:
• Since supercomputer centers do not accept any OS kernel
modification, and any daemons in login or compute nodes.
• Need to reduce additional loads of the nodes.
• Environmental issues, such as batch job script, etc. 7
9. Scenario of AT for ppOpen‐APPL/FDM
9
Execution with optimized
kernels without AT process.
Library User
Set AT parameter,
and execute the library
(OAT_AT_EXEC=1)
■Execute auto-tuner:
With fixed loop lengths
(by specifying problem size and number of
MPI processes and OpenMP threads)
Measurement of timings for target kernels.
Store the best candidate information.
Set AT parameters, and
execute the library
(OAT_AT_EXEC=0)
Store the fastest kernel
information
Using the fastest kernel without AT
(except for varying problem size and
number of MPI processes and OpenMP
threads.)
Specify problem size,
number of MPI processes
and OpenMP threads.
20. Automatic Generated Codes for
the kernel 1
ppohFDM_update_stress
#1 [Baseline]: Original 3-nested Loop
#2 [Split]: Loop Splitting with K-loop
(Separated, two 3-nested loops)
#3 [Split]: Loop Splitting with J-loop
#4 [Split]: Loop Splitting with I-loop
#5 [Split&Fusion]: Loop Fusion to #1 for K and J-loops
(2-nested loop)
#6 [Split&Fusion]: Loop Fusion to #2 for K and J-Loops
(2-nested loop)
#7 [Fusion]: Loop Fusion to #1
(loop collapse)
#8 [Split&Fusion]: Loop Fusion to #2
(loop collapse, two one-nest loop)
30. ppohFDM_ps_pack (ppOpen‐APPL/FDM)
• m_stress.f90(ppohFDM_ps_pack)
!$omp parallel do private(k,j,iptr,i)
do k=1, NZP
iptr = (k‐1)*3*NL2*NXP
do j=1, NL2
do i=1, NXP
j1_sbuff(iptr+1) = SXY(i,NYP‐NL2+j,k); j1_sbuff(iptr+2) = SYY(i,NYP‐NL2+j,k)
j1_sbuff(iptr+3) = SYZ(i,NYP‐NL2+j,k)
j2_sbuff(iptr+1) = SXY(i,j,k); j2_sbuff(iptr+2) = SYY(i,j,k); j2_sbuff(iptr+3) = SYZ(i,j,k)
iptr = iptr + 3
end do; end do; end do
!$omp end parallel do
!$omp parallel do private(k,j,iptr,i)
do k=1, NL2
iptr = (k‐1)*3*NYP*NXP
do j=1, NYP
do i=1, NXP
k1_sbuff(iptr+1) = SXZ(i,j,NZP‐NL2+k); k1_sbuff(iptr+2) = SYZ(i,j,NZP‐NL2+k)
k1_sbuff(iptr+3) = SZZ(i,j,NZP‐NL2+k)
k2_sbuff(iptr+1) = SXZ(i,j,k); k2_sbuff(iptr+2) = SYZ(i,j,k); k2_sbuff(iptr+3) = SZZ(i,j,k)
iptr = iptr + 3
end do; end do; end do
!$omp end parallel do
!OAT$ install LoopFusion region end
31. Machine Environment
(8 nodes of the Xeon Phi)
The Intel Xeon Phi
Xeon Phi 5110P (1.053 GHz), 60 cores
Memory Amount:8 GB (GDDR5)
Theoretical Peak Performance:1.01 TFLOPS
One board per node of the Xeon phi cluster
InfiniBand FDR x 2 Ports
Mellanox Connect‐IB
PCI‐E Gen3 x16
56Gbps x 2
Theoretical Peak bandwidth 13.6GB/s
Full‐Bisection
Intel MPI
Based on MPICH2, MVAPICH2
4.1 Update 3 (build 048)
Compiler:Intel Fortran version 14.0.0.080 Build 20130728
Compiler Options:
‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel –mmic
‐align array64byte
KMP_AFFINITY=granularity=fine, balanced (Uniform Distribution of threads
between sockets)
32. Execution Details
• ppOpen‐APPL/FDM ver.0.2
• ppOpen‐AT ver.0.2
• The number of time step: 2000 steps
• The number of nodes: 8 node
• Native Mode Execution
• Target Problem Size
(Almost maximum size with 8 GB/node)
– NX * NY * NZ = 1536 x 768 x 240 / 8 Node
– NX * NY * NZ = 768 * 384 * 120 / node
(!= per MPI Process)
• The number of iterations for kernels
to do auto‐tuning: 100
33. Execution Details of Hybrid
MPI/OpenMP
• Target MPI Processes and OMP Threads on the Xeon Phi
– The Xeon Phi with 4 HT (Hyper Threading)
– PX TY: X MPI Processes and Y Threads per process
– P8T240 : Minimum Hybrid MPI/OpenMP execution for
ppOpen‐APPL/FDM, since it needs minimum 8 MPI Processes.
– P16T120
– P32T60
– P64T30
– P128T15
– P240T8
– P480T4
– Less than P960T2 cause an MPI error in this environment.
#
0
#
1
#
2
#
3
#
4
#
5
#
6
#
7
#
8
#
9
#
1
0
#
1
1
#
1
2
#
1
3
#
1
4
#
1
5
P2T8
#
0
#
1
#
2
#
3
#
4
#
5
#
6
#
7
#
8
#
9
#
1
0
#
1
1
#
1
2
#
1
3
#
1
4
#
1
5
P4T4 Target of cores for
one MPI Process
38. Maximum Speedups by AT
(Xeon Phi, 8 Nodes)
558
200
171
30 20 51
Speedup [%]
Kinds of Kernels
Speedup =
max ( Execution time of original code / Execution time with AT )
for all combination of Hybrid MPI/OpenMP Executions (PXTY)
NX*NY*NZ = 1536x768x240/ 8 Node
47. Auto‐tuning Time
553
303 320
0
100
200
300
400
500
600
Xeon Phi Ivy Bridge FX10
AT Time [seconds]
update_stress
10.9%
8.3% 10.1%
AT overhead to total time
(2000 time steps) 426
189 185
0
100
200
300
400
500
Xeon Phi Ivy Bridge FX10
AT Time [seconds]
update_vel
8.4%
5.2% 5.8%
AT overhead to total time
(2000 time steps)
なぜ、カーネル数が違うか不明
(バージョンが違う?)
57. Future Work
• Improving Search Algorithm
– We use a brute‐force search in current implementation.
• But it is enough by applying knowledge of application.
– We have implemented a new search algorithm.
• d‐Spline based search algorithm (a performance model) ,
collaborated with Prof. Tanaka (Kogakuin U.)
• Code Selection Between Different Architectures
– Selection of codes between vector machine
implementation and scalar machine implementation.
– Whole code selection on main routine is needed.
• Off‐loading Implementation Selection
(for the Xeon Phi)
– If problem size is too small to do off‐loading,
the target execution is performed on CPU automatically.