SlideShare a Scribd company logo
1 of 29
Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E. Jordan  IBM T.J. Watson Research Center  HPEC 2010 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Parallel Matlab (pMatlab) ,[object Object],[object Object],[object Object],Library Layer (pMatlab) Vector/Matrix Comp Task Conduit Application Parallel Library Parallel Hardware Input Analysis  Output User Interface Hardware Interface Kernel Layer Math ( MATLAB/Octave ) Messaging ( MatlabMPI )
IBM Blue Gene/P System Core speed: 850 MHz cores LLGrid Core counts: ~1K Blue Gene/P Core counts: ~300K
Blue Gene Application Paths Serial and Pleasantly  Parallel Apps Highly Scalable Message Passing Apps High Throughput  Computing (HTC) High Performance  Computing (MPI) Blue Gene Environment ,[object Object],[object Object],[object Object]
HTC Node Modes on BG/P ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Porting pMatlab to BG/P System ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Performance Studies ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Single Process Performance: Intel Xeon vs. IBM PowerPC 450 * conv2() performance issue in Octave has been improved in a subsequent release  Lower is better HPL MandelBrot ZoomImage* Beamformer Blurimage* FFT 26.4 s 11.9 s 18.8 s 6.1 s 1.5 s 5.2 s
Octave Performance With  Optimized BLAS DGEM Performance Comparison Matrix size (N x N) MFLOPS
Single Process Performance: Stream Benchmark Higher is better 944 MB/s 1208 MB/s 996 MB/s Relative Performance to Matlab Triad a = b +  q  c Add c = a + c Scale b =  q  c
Point-to-Point Communication ,[object Object],[object Object],[object Object],Pid = 0 Pid = 1 Pid = Np-1 Pid = 3 Pid = 2
Filesystem Consideration ,[object Object],[object Object],Pid = 0 Pid = 1 Pid = Np-1 Pid = 3 Pid = 2 Pid = 0 Pid = 1 Pid = Np-1 Pid = 3 Pid = 2
pSpeed Performance on LLGrid:  Mode S Higher is better
pSpeed Performance on LLGrid: Mode M Lower is better Higher is better
pSpeed Performance on BG/P BG/P Filesystem: GPFS
pStream Results with Scaled Size ,[object Object],[object Object],563 GB/sec 100 % Efficiency at Np = 1024
pStream Results with Fixed Size ,[object Object],[object Object],VN: 8.832 TB/Sec 101% efficiency  at Np=16384 DUAL: 4.333 TB/Sec 96% efficiency  at Np=8192 SMP: 2.208 TB/Sec 98% efficiency  at Np=4096
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object]
Current Aggregation Architecture ,[object Object],[object Object],[object Object],[object Object],Np = 8 Np: total number of processes 0 1 2 3 4 5 6 7 8
Binary-Tree Based Aggregation ,[object Object],[object Object],[object Object],Maximum number of message  a process may send/receive is  N, where Np = 2^(N) 0 1 2 3 4 5 6 7 0 2 4 6 0 4
BAGG() Performance ,[object Object],[object Object],[object Object],[object Object],IBRIX: 10x faster at Np=128 LUSTRE: 8x faster at Np=128
BAGG() Performance, 2 ,[object Object],[object Object],[object Object],2.5x faster at Np=1024
Generalizing Binary-Tree Based Aggregation ,[object Object],[object Object],[object Object],[object Object],0 1 2 3 4 5 6 7 0 2 4 6 0 4
BAGG() vs. HAGG() ,[object Object],[object Object],[object Object],[object Object],~3x faster at Np=128
BAGG() vs. HAGG(), 2 ,[object Object],[object Object],[object Object]
BAGG() Performance with Crossmounts ,[object Object],[object Object],Lower is better
Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

More Related Content

What's hot

Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex PerrierAlexis Perrier
 
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Intel® Software
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...Ganesan Narayanasamy
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
Map Reduce
Map ReduceMap Reduce
Map Reduceschapht
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 

What's hot (20)

Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
 
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 

Similar to pMatlab on BlueGene

ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...Hideyuki Tanaka
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Inter Task Communication On Volatile Nodes
Inter Task Communication On Volatile NodesInter Task Communication On Volatile Nodes
Inter Task Communication On Volatile Nodesnagarajan_ka
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoSciCompIIT
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
Comparison of Open Source Virtualization Technology
Comparison of Open Source Virtualization TechnologyComparison of Open Source Virtualization Technology
Comparison of Open Source Virtualization TechnologyBenoit des Ligneris
 
HPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesHPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesInderjeet Singh
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Cheng-Hsuan Li
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
 
Profiling And Optimization Of Software Base Network Analysis Applications
Profiling And Optimization Of Software Base Network Analysis ApplicationsProfiling And Optimization Of Software Base Network Analysis Applications
Profiling And Optimization Of Software Base Network Analysis ApplicationsHargyo T. Nugroho
 
High Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud TechnologiesHigh Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud Technologiesjaliyae
 
Balman climate-c sc-ads-2011
Balman climate-c sc-ads-2011Balman climate-c sc-ads-2011
Balman climate-c sc-ads-2011balmanme
 

Similar to pMatlab on BlueGene (20)

ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Inter Task Communication On Volatile Nodes
Inter Task Communication On Volatile NodesInter Task Communication On Volatile Nodes
Inter Task Communication On Volatile Nodes
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Comparison of Open Source Virtualization Technology
Comparison of Open Source Virtualization TechnologyComparison of Open Source Virtualization Technology
Comparison of Open Source Virtualization Technology
 
HPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesHPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud Technologies
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentation
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
 
Super Computer
Super ComputerSuper Computer
Super Computer
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Profiling And Optimization Of Software Base Network Analysis Applications
Profiling And Optimization Of Software Base Network Analysis ApplicationsProfiling And Optimization Of Software Base Network Analysis Applications
Profiling And Optimization Of Software Base Network Analysis Applications
 
High Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud TechnologiesHigh Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud Technologies
 
Balman climate-c sc-ads-2011
Balman climate-c sc-ads-2011Balman climate-c sc-ads-2011
Balman climate-c sc-ads-2011
 

Recently uploaded

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

pMatlab on BlueGene

  • 1. Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E. Jordan IBM T.J. Watson Research Center HPEC 2010 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.
  • 2.
  • 3.
  • 4. IBM Blue Gene/P System Core speed: 850 MHz cores LLGrid Core counts: ~1K Blue Gene/P Core counts: ~300K
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10. Single Process Performance: Intel Xeon vs. IBM PowerPC 450 * conv2() performance issue in Octave has been improved in a subsequent release Lower is better HPL MandelBrot ZoomImage* Beamformer Blurimage* FFT 26.4 s 11.9 s 18.8 s 6.1 s 1.5 s 5.2 s
  • 11. Octave Performance With Optimized BLAS DGEM Performance Comparison Matrix size (N x N) MFLOPS
  • 12. Single Process Performance: Stream Benchmark Higher is better 944 MB/s 1208 MB/s 996 MB/s Relative Performance to Matlab Triad a = b + q c Add c = a + c Scale b = q c
  • 13.
  • 14.
  • 15. pSpeed Performance on LLGrid: Mode S Higher is better
  • 16. pSpeed Performance on LLGrid: Mode M Lower is better Higher is better
  • 17. pSpeed Performance on BG/P BG/P Filesystem: GPFS
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.

Editor's Notes

  1. 1 1 Title of the presentation to be presented at HPEC 2010, September 15-16, 2010.
  2. Outline of the presentation. Introduction will brief about motivation on porting pMatlab to mega-core systems such as IBM Blue Gene/P system and how we did it. Then performance details of the math kernel (octave) for pMatlab on BG/P is given. Also, we will present the technical details and performance improvement done for aggregation in order to work on a large scale computation.
  3. 1 1 pMatlab is the library layer and abstracts away the parallel distribution details so that the user calls pMatlab functions, which provide mechanisms for easily distributing and parallelizing Matlab code. pMatlab can use either MATLAB or Octave as the mathematics kernel, just as the messaging can accomplished with any library which can send and receive MATLAB messages. For our current implementation MatlabMPI is the communication kernel. pMatlab calls MatlabMPI functions to communicate data between processors.
  4. IBM Blue Gene/P system design diagram Each CPU runs at 850MHZ and may have 2GB or 4GB memory four cores. The Surveyor system at Argonne National Lab has 2GB memory, which was used for this study. One full rack has 1024 CPUs or 4096 compute cores.
  5. IBM Blue Gene/P system supportss two major computing environment. High Throughput Computing (HTC) environment is ideal for serial and pleasantly parallel applications. High Performance Computing environment targets for highly scalable message passing applications with MPI library. HTC environment fits well for porting pMatlab environment to BG/P.
  6. In Symmetric Multiprocessing (SMP) mode each physical Compute Node executes a single process per node with a default maximum of four threads. The process can access the full physical memory. With Dual and VN mode for HTC, only ½ and ¼ of the full memory is accessible by each process. For example, the Surveyor system has 2GB memory per compute node. So a process with SMP mode has approximately 2GB minus the space occupied by the OS. DUAL-mode process has about 1GB RAM and VN-mode process has about 512MB RAM.
  7. Porting pMatlab to BG/P system involves how to integrate BG/P job launch mechanism to the pMatlab job launch mechanism. It involves two steps to run pMatlab jobs on BG/P. First, it requests and boots a BG partition in HTC mode using the qsub command on Surveyor. Then, once the partition gets booted, execute the submit command to submit a job to the partition. By combining the two step under the pRUN command, pMatlab jobs can be executed on BG/P.
  8. Outline Slide for performance studies. We will talk about single process performance, point-to-point communication performance and scalability results using parallel stream benchmark.
  9. Examples used for the performance studies, which are available from the pMatlab package.
  10. Performance of example codes running on a single processor core is compared between Intel Xeon and IBM PowerPC chips
  11. Using an optimized BLAS library (ATLAS), Octave 3.2.4 binary was able to run faster than Matlab 2009a in DGEMM and HPL benchmark examples.
  12. Stream benchmark performance is compared between LLGrid and Surveyor (IBM BG/P) systems
  13. Point-to-point communication performance on BG/P is measured by using pSpeed, a pMatlab code example.
  14. Because messages are files in pMatlab, it is important to have efficient file system in order to scale for a large number of processes. In a single nfs-shared file system, this will have serious resource contention as the number of processes increase. However, by using a group of cross-mounted distributed file system, we may mitigate this issue. The next slide shows a demonstration of resource contention with file system.
  15. The pSpeed performance with MATLAB was measured when all messages were written on a single, nfs-shared disk. It started showing resource contention with four processes and with 8 processes, the issue is obvious.
  16. When using crossmounts for pMatlab message communication, the jittery behavior has been resolved. The overall performance of Octave matches well with that of Matlab when using four and eight processes.
  17. The point-to-point communication (pSpeed) on BG/P system performed as expected when using IBM’s proprietary parallel file system, GPFS
  18. Now we are looking at different perspective on scalability. We increase the problem size proportionally as we increase the number of processes so that each process has the amount of work everytime. With SMP mode, the initial global array size is set to 2^25 with Np=1 which is the largest size we can run on a single process. As scale the problem size proportionally with the number of processes, the bandwidth scales linearly. At Np=1024, Triad bandwidth is 563GB/sec with 100% parallel efficiency. At Np=1024, 1024 nodes and 1 processes per node
  19. pStream benchmark was able to scaled up to 16,384 processes on an IBM internal BG/P machine
  20. Outline Slide for aggregation optimization done for large scale computation
  21. The current aggregation architecture is easy to implement but it is inherently sequential. Therefore, the collection process will slow down significantly.
  22. Binary tree based aggregation can significantly reduce the number of messages that a process may send and receive although message itself becomes larger. For example, when Np =8, the leader process receives three messages with BAGG() while it receives 7 messages with AGG().
  23. On MIT Lincoln Laboratory cluster system, the performance of AGG() and BAGG() is compared for a 2-D data and process distribution using two different file systems. At large process numbers, BAGG() performs significantly better than AGG(). With 128 processes, BAGG() achieved 10x improvement on IBRIX and 8x improvement on LUSTRE file systems. The performance number is highly dependent on the file system loads since they were obtained on a production cluster.
  24. On IBM Blue Gene/P system, the performance comparison was done for a wide range of processes, from 8 to 1024 processes by using a 4-D data and process distribution. With GPFS file system, BAGG() runs 2.5x faster than AGG().
  25. BAGG() only works when Np is a power of two number. We generalized BAGG() so that it can work with any number of processes. The generalization is based on an extended binary tree so that we can still use the binary tree based aggregation architecture. There will be some additional costs associated with those fictitious processes. For those fictitious processes, we skip message communication.
  26. The costs associated with additional bookkeeping appears to be minimal. On a production MIT Lincoln Laboratory cluster, the performance of AGG(), BAGG() and HAGG() were compared. At Np = 128, BAGG() and HAGG() runs about 3x faster than AGG()
  27. On an IBM Blue Gene/P system, the cost of additional bookkeeping was studied on a wide range of processes using a 4-D data and process distribution using GPFS file system. Both performs almost identically through out the entire range of processes, from 8 to 512 processes.
  28. On LLGrid system with using a hybrid configuration (central file system for the lead process and crossmounts for compute nodes), the new agg() function can reduce resource contention on file system. Thus, it improved agg() performance significantly.
  29. Summary slide