SlideShare a Scribd company logo
1 of 35
Download to read offline
Analysis and Optimization
of CGPOP
Hongtao Cai, Xiaoxiang Hu, Haoruo Peng
Department of CST, Tsinghua University
SIAM Annual Meeting, July 9, 2012
Acknowledgment
 Prof. XiaogeWang , Prof.Wei Xue
 Support from the State 863 Project Fund
 Support from Explore-100,Tianhe-1A, Shenwei
supercomputer systems
 Support from SIAM
2
Outline
 Background
 Research
 Analysis of original PCG Method in CGPOP
 Optimizations:
 Chebyshev
 Richardson-PCG
 Richardson-Chebyshev
 Experiments
 FutureWork
3
Outline
 Background
 Research
 Analysis of original PCG Method in CGPOP
 Optimizations:
 Chebyshev
 Richardson-PCG
 Richardson-Chebyshev
 Experiments
 FutureWork
4
Parallel Ocean Program
 The crucial role of Oceans in Global Climate
 70% of earth surface
 Water 1000 times higher the heat capacity of air
 repository of carbon(93%)
 Transport heat
 POP : Surface Pressure of Oceans[1]
5
Conjugate Gradient Parallel Ocean
Program (CGPOP)
 Three computation parts: Barotropic, 3D-update, Baroclinic
 Barotropic computation
dominates when core number
exceeds 10,000 [2]
 CGPOP contains the core
part of Barotropic compuation
6
Conjugate Gradient Parallel Ocean
Program (CGPOP)
 Linear equation system in every time step
 𝛻 ∙ 𝐻𝛻 −
1
𝑔𝛼𝜏∆𝑡
𝜂 𝑛+1
= 𝛻 ∙ 𝐻
𝑈
𝑔𝛼𝜏
+ 𝛻𝜂 𝑛−1
−
𝜂 𝑛
𝑔𝛼𝜏∆𝑡
−
𝑞 𝑊
𝑛
𝑔𝛼𝜏
Ax = b
 (A is a real, sparse, symmetric, positive-definite matrix)
 Our work: Exploring new algorithms in CGPOP. Experiments
on top supercomputer in the world.
7
Outline
 Background
 Research
 Analysis of original PCG Method in CGPOP
 Optimizations:
 Chebyshev
 Richardson-PCG
 Richardson-Chebyshev
 Experiments
 FutureWork
8
Chron-Gear PreconditionedConjugate
Gradient Solver
Matrix-vector Multiplication, Dot Product, Daxpy : Communication9
PCG Solver
(on Shenwei Supercomputer)
Percentage of Time consumed by Dot Product
10
ThreeVariants
 1S1D /2S2D/2S1D
 1-Sided MPI : put/get
 2-Sided MPI : send/receive
 2D : direct data access, more memory
 1D : Ocean points stored compactly. Less memory, indirect data
access
2D 1D
11
ThreeVariants
TotalTime for 1Time Step(on Tianhe-1A )
12
Analysis Conclusions
 Dot product consumes time
 Three variants – 2s1d selected as the benchmark
13
Outline
 Background
 Research
 Analysis of original PCG Method
 Optimizations
 Chebyshev
 Richardson-PCG
 Richardson-Chebyshev
 Experiments
 FutureWork
14
Chebyshev
Mat-vec Mul, Daxpy, No Dot Product
15
Chebyshev
PCG 4 Daxpy + 1 MV + 3 DP
CBS
× 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟1
3 Daxpy + 1 MV × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟2
Dot Product(DP)Daxpy Mat-Vec Mul(MV)
16
Chebyshev
17
Outline
 Background
 Research
 Analysis of original PCG Method
 Optimizations
 Chebyshev
 Richardson-PCG
 Richardson-Chebyshev
 Experiments
 FutureWork
18
Richardson-PCG
 Single Precision: Faster[5]
 A processor can take 2 double or 4 single at a time
 Memory Pressure
 Double Precision: MoreAccurate
 Mix them up
19
Richardson-PCG
 Richardson Method ( Splitting Method )
 Iteration:
 Our Motivation: Let 𝑀 = 𝐴 𝑓𝑙𝑜𝑎𝑡 , s.t. 𝑀−1 𝑁 = 𝐼 − 𝑀−1 𝐴 ≈ 0
 Our Method:
𝐴𝑥 = 𝑏, 𝐴 = 𝑀 − 𝑁 𝑥 = 𝑀−1
𝑁𝑥 + 𝑀−1
𝑏
𝑥 𝑘+1 ← 𝑀−1
𝑁𝑥 𝑘 + 𝑀−1
𝑏 𝜌 𝑀−1
𝑁 < 1
𝑥 𝑘+1 ← 𝑥 𝑘 + 𝑀−1
(𝑏 − 𝐴𝑥 𝑘)
Same as solving AfloatΔ𝑥 = (𝑏 − 𝐴𝑥 𝑘)
Approximation :Tolerance
20
Richardson-PCG
𝑥 𝑘+1 ← 𝑥 𝑘 + 𝑀−1(𝑏 − 𝐴𝑥 𝑘)
21
Richardson-PCG
PCG 4 Daxpy + 1 DMV + 3 DDP × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟1
Rich-PCG × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟3 +4 Saxpy + 1 SMV + 3 SDP + 2 CV
2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟
Double Mat-Vec Mul (DMV) DaxpyDouble Dot Product(DDP)
Single Mat-Vec Mul (SMV) SaxpySingle Dot Product(SDP)
ConvertVector(CV) Convert Matrix(CM)
1CM +
22
Richardson-PCG
23
Outline
 Background
 Research
 Analysis of original PCG Method
 Optimizations
 Chebyshev
 Richardson-PCG
 Richardson-Chebyshev
 Experiments
 FutureWork
24
Richardson-Chebyshev
25
Richardson-Chebyshev
Rich-CBS × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟4 +3 Saxpy + 1 SMV + 2 CV
2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟4
Rich-PCG × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟3 +4 Saxpy + 1 SMV + 3 SDP + 2 CV
2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟3
1CM +
1CM +
26
Richardson-Chebyshev
27
Outline
 Background
 Research
 Analysis of original PCG Method
 Optimizations
 Chebyshev
 Richardson-PCG
 Richardson-Chebyshev
 Experiments
 FutureWork
28
Experiments
 Our supercomputers:
 Tianhe-1A
 CPU: 2.93GHz Intel Xeon X5670
 Memory: 32/48GB per node. Bandwidth: 40GB/s
 Network: 160Gbps, 22ns. Fat tree structure
 Shenwei
 CPU: 1.1GHz Shenwei Processor
 Memory: 32GB per node. Bandwidth: 68GB/s (List result)
 Network: Crossbar for every 256 CPU. Fat tree structure
29
Experiments onTianhe-1A
30
Experiments on Shenwei
31
Conclusion
 Two techniques
 Reducing dot-products
 Effective in large core numbers ( more than 5000)
 Mixed precision
 Effective in small core numbers ( less than 1000)
32
Outline
 Background
 Research
 Analysis of original PCG Method
 Optimizations
 Chebyshev
 Richardson-PCG
 Richardson-Chebyshev
 Experiments
 FutureWork
33
FutureWork
 Complete the investigation of the current code
 IntegrateOptimization techniques into our ocean modeling
programs
 Apply our methods to other parallel programs
34
References
[1] R. Smith, P. Gent, “Reference Manual for the Parallel Ocean Program(POP)”,
May, 2002, Page 1-74.
[2]A. Stone, J. M. Dennis, M. M. Strout, “The CGPOP Miniapp,Version 1.0”,
July, 2011, Page 4-5.
[3]Y. Saad, A. Sameh, P. Saylor, “Solving elliptic difference equations on a
linear array of processors”, SIAM J. Sci. Stat. Comput.,Vol. 6, No. 4, October
1985, Page 1049-1063.
[4] E. Stiefel, “Kernel polynomials in linear algebra and their numerical
applications”, Nat. Bur. Standards, Appl. Math. Series 49, 1958, page 1-22.
[5] A. Buttari, E. Lyon, J. Dongarra. “Using Mixed Precision for Sparse Matrix
Computations to Enhance the Performance while Achieving 64-bit Accuracy”,
ACMTransactions on Math. Software,Vol.34, No.4, Article 17, Page 1-8.
35

More Related Content

Similar to CGPOP Analysis and Optimization

Genedis 2016 conference - Nubacom section | Presentation
Genedis 2016 conference - Nubacom section  | PresentationGenedis 2016 conference - Nubacom section  | Presentation
Genedis 2016 conference - Nubacom section | PresentationChristos Papalitsas
 
Uncovering the Elusive HIV Capsid with Kepler GPUs Running NAMD and VMD
Uncovering the Elusive HIV Capsid with Kepler GPUs Running NAMD and VMDUncovering the Elusive HIV Capsid with Kepler GPUs Running NAMD and VMD
Uncovering the Elusive HIV Capsid with Kepler GPUs Running NAMD and VMDCan Ozdoruk
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Paolo Missier
 
A Performance Study of BDD-Based Model Checking
A Performance Study of BDD-Based Model CheckingA Performance Study of BDD-Based Model Checking
A Performance Study of BDD-Based Model CheckingOlivier Coudert
 
HPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific ComputingHPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific Computinginside-BigData.com
 
Computationally efficient surrogate based multi-objective optimisation for PS...
Computationally efficient surrogate based multi-objective optimisation for PS...Computationally efficient surrogate based multi-objective optimisation for PS...
Computationally efficient surrogate based multi-objective optimisation for PS...Eric Fraga
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
 
Injecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive SubsamplingInjecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive SubsamplingMartino Ferrari
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
 
RDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation PruningRDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation Pruningwajrcs
 
OpenACC Monthly Highlights: February 2022
OpenACC Monthly Highlights: February 2022OpenACC Monthly Highlights: February 2022
OpenACC Monthly Highlights: February 2022OpenACC
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
 
Druinsky_SIAMCSE15
Druinsky_SIAMCSE15Druinsky_SIAMCSE15
Druinsky_SIAMCSE15Karen Pao
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC
 
OpenACC Monthly Highlights - February 2018
OpenACC Monthly Highlights - February 2018OpenACC Monthly Highlights - February 2018
OpenACC Monthly Highlights - February 2018NVIDIA
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
Grasp approach to rcpsp with min max robustness objective
Grasp approach to rcpsp with min max robustness objectiveGrasp approach to rcpsp with min max robustness objective
Grasp approach to rcpsp with min max robustness objectivecsandit
 

Similar to CGPOP Analysis and Optimization (20)

Genedis 2016 conference - Nubacom section | Presentation
Genedis 2016 conference - Nubacom section  | PresentationGenedis 2016 conference - Nubacom section  | Presentation
Genedis 2016 conference - Nubacom section | Presentation
 
Uncovering the Elusive HIV Capsid with Kepler GPUs Running NAMD and VMD
Uncovering the Elusive HIV Capsid with Kepler GPUs Running NAMD and VMDUncovering the Elusive HIV Capsid with Kepler GPUs Running NAMD and VMD
Uncovering the Elusive HIV Capsid with Kepler GPUs Running NAMD and VMD
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
 
A Performance Study of BDD-Based Model Checking
A Performance Study of BDD-Based Model CheckingA Performance Study of BDD-Based Model Checking
A Performance Study of BDD-Based Model Checking
 
HPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific ComputingHPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific Computing
 
Computationally efficient surrogate based multi-objective optimisation for PS...
Computationally efficient surrogate based multi-objective optimisation for PS...Computationally efficient surrogate based multi-objective optimisation for PS...
Computationally efficient surrogate based multi-objective optimisation for PS...
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
Injecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive SubsamplingInjecting image priors into Learnable Compressive Subsampling
Injecting image priors into Learnable Compressive Subsampling
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 
RDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation PruningRDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation Pruning
 
OpenACC Monthly Highlights: February 2022
OpenACC Monthly Highlights: February 2022OpenACC Monthly Highlights: February 2022
OpenACC Monthly Highlights: February 2022
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
Druinsky_SIAMCSE15
Druinsky_SIAMCSE15Druinsky_SIAMCSE15
Druinsky_SIAMCSE15
 
Europy17_dibernardo
Europy17_dibernardoEuropy17_dibernardo
Europy17_dibernardo
 
Step zhedong
Step zhedongStep zhedong
Step zhedong
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020
 
OpenACC Monthly Highlights - February 2018
OpenACC Monthly Highlights - February 2018OpenACC Monthly Highlights - February 2018
OpenACC Monthly Highlights - February 2018
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Grasp approach to rcpsp with min max robustness objective
Grasp approach to rcpsp with min max robustness objectiveGrasp approach to rcpsp with min max robustness objective
Grasp approach to rcpsp with min max robustness objective
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 

Recently uploaded

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

CGPOP Analysis and Optimization

  • 1. Analysis and Optimization of CGPOP Hongtao Cai, Xiaoxiang Hu, Haoruo Peng Department of CST, Tsinghua University SIAM Annual Meeting, July 9, 2012
  • 2. Acknowledgment  Prof. XiaogeWang , Prof.Wei Xue  Support from the State 863 Project Fund  Support from Explore-100,Tianhe-1A, Shenwei supercomputer systems  Support from SIAM 2
  • 3. Outline  Background  Research  Analysis of original PCG Method in CGPOP  Optimizations:  Chebyshev  Richardson-PCG  Richardson-Chebyshev  Experiments  FutureWork 3
  • 4. Outline  Background  Research  Analysis of original PCG Method in CGPOP  Optimizations:  Chebyshev  Richardson-PCG  Richardson-Chebyshev  Experiments  FutureWork 4
  • 5. Parallel Ocean Program  The crucial role of Oceans in Global Climate  70% of earth surface  Water 1000 times higher the heat capacity of air  repository of carbon(93%)  Transport heat  POP : Surface Pressure of Oceans[1] 5
  • 6. Conjugate Gradient Parallel Ocean Program (CGPOP)  Three computation parts: Barotropic, 3D-update, Baroclinic  Barotropic computation dominates when core number exceeds 10,000 [2]  CGPOP contains the core part of Barotropic compuation 6
  • 7. Conjugate Gradient Parallel Ocean Program (CGPOP)  Linear equation system in every time step  𝛻 ∙ 𝐻𝛻 − 1 𝑔𝛼𝜏∆𝑡 𝜂 𝑛+1 = 𝛻 ∙ 𝐻 𝑈 𝑔𝛼𝜏 + 𝛻𝜂 𝑛−1 − 𝜂 𝑛 𝑔𝛼𝜏∆𝑡 − 𝑞 𝑊 𝑛 𝑔𝛼𝜏 Ax = b  (A is a real, sparse, symmetric, positive-definite matrix)  Our work: Exploring new algorithms in CGPOP. Experiments on top supercomputer in the world. 7
  • 8. Outline  Background  Research  Analysis of original PCG Method in CGPOP  Optimizations:  Chebyshev  Richardson-PCG  Richardson-Chebyshev  Experiments  FutureWork 8
  • 9. Chron-Gear PreconditionedConjugate Gradient Solver Matrix-vector Multiplication, Dot Product, Daxpy : Communication9
  • 10. PCG Solver (on Shenwei Supercomputer) Percentage of Time consumed by Dot Product 10
  • 11. ThreeVariants  1S1D /2S2D/2S1D  1-Sided MPI : put/get  2-Sided MPI : send/receive  2D : direct data access, more memory  1D : Ocean points stored compactly. Less memory, indirect data access 2D 1D 11
  • 12. ThreeVariants TotalTime for 1Time Step(on Tianhe-1A ) 12
  • 13. Analysis Conclusions  Dot product consumes time  Three variants – 2s1d selected as the benchmark 13
  • 14. Outline  Background  Research  Analysis of original PCG Method  Optimizations  Chebyshev  Richardson-PCG  Richardson-Chebyshev  Experiments  FutureWork 14
  • 15. Chebyshev Mat-vec Mul, Daxpy, No Dot Product 15
  • 16. Chebyshev PCG 4 Daxpy + 1 MV + 3 DP CBS × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟1 3 Daxpy + 1 MV × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟2 Dot Product(DP)Daxpy Mat-Vec Mul(MV) 16
  • 18. Outline  Background  Research  Analysis of original PCG Method  Optimizations  Chebyshev  Richardson-PCG  Richardson-Chebyshev  Experiments  FutureWork 18
  • 19. Richardson-PCG  Single Precision: Faster[5]  A processor can take 2 double or 4 single at a time  Memory Pressure  Double Precision: MoreAccurate  Mix them up 19
  • 20. Richardson-PCG  Richardson Method ( Splitting Method )  Iteration:  Our Motivation: Let 𝑀 = 𝐴 𝑓𝑙𝑜𝑎𝑡 , s.t. 𝑀−1 𝑁 = 𝐼 − 𝑀−1 𝐴 ≈ 0  Our Method: 𝐴𝑥 = 𝑏, 𝐴 = 𝑀 − 𝑁 𝑥 = 𝑀−1 𝑁𝑥 + 𝑀−1 𝑏 𝑥 𝑘+1 ← 𝑀−1 𝑁𝑥 𝑘 + 𝑀−1 𝑏 𝜌 𝑀−1 𝑁 < 1 𝑥 𝑘+1 ← 𝑥 𝑘 + 𝑀−1 (𝑏 − 𝐴𝑥 𝑘) Same as solving AfloatΔ𝑥 = (𝑏 − 𝐴𝑥 𝑘) Approximation :Tolerance 20
  • 21. Richardson-PCG 𝑥 𝑘+1 ← 𝑥 𝑘 + 𝑀−1(𝑏 − 𝐴𝑥 𝑘) 21
  • 22. Richardson-PCG PCG 4 Daxpy + 1 DMV + 3 DDP × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟1 Rich-PCG × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟3 +4 Saxpy + 1 SMV + 3 SDP + 2 CV 2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟 Double Mat-Vec Mul (DMV) DaxpyDouble Dot Product(DDP) Single Mat-Vec Mul (SMV) SaxpySingle Dot Product(SDP) ConvertVector(CV) Convert Matrix(CM) 1CM + 22
  • 24. Outline  Background  Research  Analysis of original PCG Method  Optimizations  Chebyshev  Richardson-PCG  Richardson-Chebyshev  Experiments  FutureWork 24
  • 26. Richardson-Chebyshev Rich-CBS × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟4 +3 Saxpy + 1 SMV + 2 CV 2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟4 Rich-PCG × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟3 +4 Saxpy + 1 SMV + 3 SDP + 2 CV 2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟3 1CM + 1CM + 26
  • 28. Outline  Background  Research  Analysis of original PCG Method  Optimizations  Chebyshev  Richardson-PCG  Richardson-Chebyshev  Experiments  FutureWork 28
  • 29. Experiments  Our supercomputers:  Tianhe-1A  CPU: 2.93GHz Intel Xeon X5670  Memory: 32/48GB per node. Bandwidth: 40GB/s  Network: 160Gbps, 22ns. Fat tree structure  Shenwei  CPU: 1.1GHz Shenwei Processor  Memory: 32GB per node. Bandwidth: 68GB/s (List result)  Network: Crossbar for every 256 CPU. Fat tree structure 29
  • 32. Conclusion  Two techniques  Reducing dot-products  Effective in large core numbers ( more than 5000)  Mixed precision  Effective in small core numbers ( less than 1000) 32
  • 33. Outline  Background  Research  Analysis of original PCG Method  Optimizations  Chebyshev  Richardson-PCG  Richardson-Chebyshev  Experiments  FutureWork 33
  • 34. FutureWork  Complete the investigation of the current code  IntegrateOptimization techniques into our ocean modeling programs  Apply our methods to other parallel programs 34
  • 35. References [1] R. Smith, P. Gent, “Reference Manual for the Parallel Ocean Program(POP)”, May, 2002, Page 1-74. [2]A. Stone, J. M. Dennis, M. M. Strout, “The CGPOP Miniapp,Version 1.0”, July, 2011, Page 4-5. [3]Y. Saad, A. Sameh, P. Saylor, “Solving elliptic difference equations on a linear array of processors”, SIAM J. Sci. Stat. Comput.,Vol. 6, No. 4, October 1985, Page 1049-1063. [4] E. Stiefel, “Kernel polynomials in linear algebra and their numerical applications”, Nat. Bur. Standards, Appl. Math. Series 49, 1958, page 1-22. [5] A. Buttari, E. Lyon, J. Dongarra. “Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy”, ACMTransactions on Math. Software,Vol.34, No.4, Article 17, Page 1-8. 35