This document summarizes a study on the impact of hardware and compiler options on the performance of ESP-r and EnergyPlus building simulation tools. Key findings include:
- Disk type significantly impacts performance, with SSDs outperforming rotational disks and SD cards. Virtual disks are around half the speed of native disks.
- Memory constraints can slow down both the build process and simulation runtime if swap space is used excessively.
- Processor type matters less than other factors, but multiple cores can help if tasks are disk-bound rather than CPU-bound.
- Adjusting compiler options and task ordering can substantially reduce build times, data extraction times, and simulation runtimes across different hardware configurations. Optimization techniques developed for
Comparing Enterprise Server And Storage Networking OptionsAngel Villar Garea
DOWNLOAD ORIGINAL FROM IBM: http://www.ibm.com/common/ssi/cgi-bin/ssialias?subtype=WH&infotype=SA&appname=STGE_QC_QC_USEN&htmlfid=QCL12384USEN&attachment=QCL12384USEN.PDF
Comparing Enterprise Server And Storage Networking OptionsAngel Villar Garea
DOWNLOAD ORIGINAL FROM IBM: http://www.ibm.com/common/ssi/cgi-bin/ssialias?subtype=WH&infotype=SA&appname=STGE_QC_QC_USEN&htmlfid=QCL12384USEN&attachment=QCL12384USEN.PDF
NAND Flash is the nonvolatile memory used in virtually all mobile devices (smartphones, tablets, cameras, game controllers). High performance products (Tablets and smartphones) place increasing demands on NAND
Flash device capacity, cost and bandwidth. To meet these demands, component and application processor designers must utilize a complex combination of electronic hardware and software. As a result benchmarking NAND Flash at the component and system level is a key element in successful product design.
Performance Evaluation of the KVM Hypervisor Running on Arm-Based Single-Boar...IJCNCJournal
Single-Board Computers (SBCs) were initially targeted for education and small projects with low powerprocessing needs. However, their computational power has increased dramatically in the last few years, and they are now used in more advanced developments. In this paper, a study of the feasibility of using ARM-based SBCs as hypervisors is done. The authors selected the Raspberry Pi 4 Model B and the ODROID-N2+ and assessed them as virtualization servers, when running up to four VMs simultaneously, with the Linux de facto hypervisor (KVM). The tests performed in this work include: reading and writing throughputs in different types of storage media, processing power assessment, memory performance, timed compilations of open-source software, and performance of encryption algorithms. The results of the experiments showed that the amount of memory available in these SBCs is a determinant factor about the maximum number of VMs that can be executed simultaneously. The performance of the ODROID-N2+ exceeded the Raspberry Pi 4 Model B. However, the community support received by the latter is huge compared to the one of the former, and this can be a game changer when selecting a viable platform.
PERFORMANCE EVALUATION OF THE KVM HYPERVISOR RUNNING ON ARM-BASED SINGLE-BOAR...IJCNCJournal
Single-Board Computers (SBCs) were initially targeted for education and small projects with low powerprocessing needs. However, their computational power has increased dramatically in the last few years,
and they are now used in more advanced developments. In this paper, a study of the feasibility of using
ARM-based SBCs as hypervisors is done. The authors selected the Raspberry Pi 4 Model B and the
ODROID-N2+ and assessed them as virtualization servers, when running up to four VMs simultaneously,
with the Linux de facto hypervisor (KVM). The tests performed in this work include: reading and writing
throughputs in different types of storage media, processing power assessment, memory performance, timed
compilations of open-source software, and performance of encryption algorithms. The results of the
experiments showed that the amount of memory available in these SBCs is a determinant factor about the
maximum number of VMs that can be executed simultaneously. The performance of the ODROID-N2+
exceeded the Raspberry Pi 4 Model B. However, the community support received by the latter is huge
compared to the one of the former, and this can be a game changer when selecting a viable platform.
PERFORMANCE AND ENERGY-EFFICIENCY ASPECTS OF CLUSTERS OF SINGLE BOARD COMPUTERSijdpsjournal
When a high performance cluster is demanded and the cost for purchase and operation of servers,
workstations or personal computers as nodes is a challenge, single board computers may be an option to
build inexpensive cluster systems. This paper describes the construction of such clusters and analyzes their
performance and energy-efficiency with the High Performance Linpack (HPL) benchmark.
PERFORMANCE AND ENERGY-EFFICIENCY ASPECTS OF CLUSTERS OF SINGLE BOARD COMPUTERSijdpsjournal
When a high performance cluster is demanded and the cost for purchase and operation of servers,
workstations or personal computers as nodes is a challenge, single board computers may be an option to
build inexpensive cluster systems. This paper describes the construction of such clusters and analyzes their
performance and energy-efficiency with the High Performance Linpack (HPL) benchmark.
Top 10 Supercomputers With Descriptive Information & AnalysisNomanSiddiqui41
Top 10 Supercomputers Report
What is Supercomputer?
A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2017, there are supercomputers which can perform over 1017 FLOPS (a hundred quadrillion FLOPS, 100 petaFLOPS or 100 PFLOPS
Supercomputers play an important role in the field of computational science, and are used for a wide range of computationally intensive tasks in various fields, including quantum mechanics, weather forecasting, climate research, oil and gas exploration, molecular modeling (computing the structures and properties of chemical compounds, biological macromolecules, polymers, and crystals), and physical simulations (such as simulations of the early moments of the universe, airplane and spacecraft aerodynamics, the detonation of nuclear weapons, and nuclear fusion). They have been essential in the field of cryptanalysis.
1. The Fugaku Supercomputer
Introduction:
Fugaku is a petascale supercomputer (while only at petascale for mainstream benchmark), at the Riken Center for Computational Science in Kobe, Japan. It started development in 2014 as the successor to the K computer, and started operating in 2021. Fugaku made its debut in 2020, and became the fastest supercomputer in the world in the June 2020 TOP500 list, as well as becoming the first ARM architecture-based computer to achieve this. In June 2020, it achieved 1.42 exaFLOPS (in HPL-AI benchmark making it the first ever supercomputer that achieved 1 exaFLOPS. As of November 2021, Fugaku is the fastest supercomputer in the world. It is named after an alternative name for Mount Fuji.
Block Diagram:
Functional Units:
Functional Units, Co-Design and System for the Supercomputer “Fugaku”
1. Performance estimation tool: This tool, taking Fujitsu FX100 (FX100 is the previous Fujitsu supercomputer) execution profile data as an input, enables the performance projection by a given set of architecture parameters. The performance projection is modeled according to the Fujitsu microarchitecture. This tool can also estimate the power consumption based on the architecture model.
2. Fujitsu in-house processor simulator: We used an extended FX100 SPARC instruction-set simulator and compiler, developed by Fujitsu, for preliminary studies in the initial phase, and an Armv8þSVE simulator and compiler afterward.
3. Gem5 simulator for the Post-K processor: The Post-K processor simulator3 based on an opensource system-level processor simulator, Gem5, was developed by RIKEN during the co-design for architecture verification and performance tuning. A fundamental problem is the scale of scientific applications that are expected to be run on Post-K. Even our target applications are thousands of lines of code and are written to use complex algorithms and data structures. Altho
Dynamic Simulation of Chemical Kinetics in MicrocontrollerIJERA Editor
Arduino boards are interesting computational tools due to low cost and power consumption, as well as I/O ports, both analogs and digitals. Yet, small memory and clock frequency with truncation errors may disrupt numerical processing. This study aimed to design and evaluate the performance of a dynamic simulation based on ODEs in the Arduino, with three evaluated microprocessors; ATMEGA 328P and 2560, both 8 bits, and SAM3X8E Atmel ARM CORTEX – 32 bits. The study case was a batch reactor dynamic simulation. The Runge-Kutta 4th order algorithm was written in C++ and compiled for EPROM utilization. The output was a 115000bit/s serial connection. Processing time was almost identical for 8 bits architectures, while 32 bits was 25% faster. Without the serial connection the 8 bits architectures were 16 times faster and the 32 bits was 42 times faster. Error truncation was similar, since the floating points are done through software. The Arduino platform, despite its modest hardware, allows simple chemical systems simulation.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
Après la petite intro sur le stockage distribué et la description de Ceph, Jian Zhang réalise dans cette présentation quelques benchmarks intéressants : tests séquentiels, tests random et surtout comparaison des résultats avant et après optimisations. Les paramètres de configuration touchés et optimisations (Large page numbers, Omap data sur un disque séparé, ...) apportent au minimum 2x de perf en plus.
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
Yao Yao Mooyoung Lee
https://github.com/yaowser/learn-spark/tree/master/Final%20project
https://www.youtube.com/watch?v=IVMbSDS4q3A
https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform/
Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics
Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ijdpsjournal
This paper studies the performance and energy consumption of several multi-core, multi-CPUs and manycore
hardware platforms and software stacks for parallel programming. It uses the Multimedia Multiscale
Parser (MMP), a computationally demanding image encoder application, which was ported to several
hardware and software parallel environments as a benchmark. Hardware-wise, the study assesses
NVIDIA's Jetson TK1 development board, the Raspberry Pi 2, and a dual Intel Xeon E5-2620/v2 server, as
well as NVIDIA's discrete GPUs GTX 680, Titan Black Edition and GTX 750 Ti. The assessed parallel
programming paradigms are OpenMP, Pthreads and CUDA, and a single-thread sequential version, all
running in a Linux environment. While the CUDA-based implementation delivered the fastest execution, the
Jetson TK1 proved to be the most energy efficient platform, regardless of the used parallel software stack.
Although it has the lowest power demand, the Raspberry Pi 2 energy efficiency is hindered by its lengthy
execution times, effectively consuming more energy than the Jetson TK1. Surprisingly, OpenMP delivered
twice the performance of the Pthreads-based implementation, proving the maturity of the tools and
libraries supporting OpenMP.
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ijdpsjournal
This paper studies the performance and energy consumption of several multi-core, multi-CPUs and manycore
hardware platforms and software stacks for parallel programming. It uses the Multimedia Multiscale
Parser (MMP), a computationally demanding image encoder application, which was ported to several
hardware and software parallel environments as a benchmark. Hardware-wise, the study assesses
NVIDIA's Jetson TK1 development board, the Raspberry Pi 2, and a dual Intel Xeon E5-2620/v2 server, as
well as NVIDIA's discrete GPUs GTX 680, Titan Black Edition and GTX 750 Ti. The assessed parallel
programming paradigms are OpenMP, Pthreads and CUDA, and a single-thread sequential version, all
running in a Linux environment. While the CUDA-based implementation delivered the fastest execution, the
Jetson TK1 proved to be the most energy efficient platform, regardless of the used parallel software stack.
Although it has the lowest power demand, the Raspberry Pi 2 energy efficiency is hindered by its lengthy
execution times, effectively consuming more energy than the Jetson TK1. Surprisingly, OpenMP delivered
twice the performance of the Pthreads-based implementation, proving the maturity of the tools and
libraries supporting OpenMP.
I understand that physics and hardware emmaded on the use of finete .pdfanil0878
I understand that physics and hardware emmaded on the use of finete element methods to predict
fluid flow over airplane wings,that progress is likely to continue. However, in recent years, this
progress has been achieved through greatly increased hardware complexity with the rise of
multicore and manycore processors, and this is affecting the ability of application developers to
achieve the full potential of these systems. currently performance is measured on a dense
matrix–matrix multiplication test which has questionable relevance to real applications.the
incredible advances in processor technology and all of the accompanying aspects of computer
system design, such as the memory subsystem and networking
In embedded it seems to combination of both hardware and the software , it is used to be
combined function of action in the systems .while we do that the application to developed in the
achieve the full potential of the systems in advanced processer technology.
Hardware
(1) Memory
Advances in memory technology have struggled to keep pace with the phenomenal advances in
processors. This difficulty in improving the main memory bandwidth led to the development of a
cache hierarchy with data being held in different cache levels within the processor. The idea is
that instead of fetching the required data multiple times from the main memory, it is instead
brought into the cache once and re-used multiple times. Intel allocates about half of the chip to
cache, with the largest LLC (last-level cache) being 30MB in size. IBM\'s new Power8 CPU has
an even larger L3 cache of up to 96MB [4]. By contrast, the largest L2 cache in NVIDIA\'s
GPUs is only 1.5MB.These different hardware design choices are motivated by careful
consideration of the range of applications being run by typical users.
One complication which has become more common and more important in the past few years is
non-uniform memory access. Ten years ago, most shared-memory multiprocessors would have
several CPUs sharing a memory bus to access a single main memory. A final comment on the
memory subsystem concerns the energy cost of moving data compared to performing a single
floating point computation.
(2) Processors
CPUs had a single processing core, and the increase in performance came partly from an increase
in the number of computational pipelines, but mainly through an increase in clock frequency.
Unfortunately, the power consumption is approximately proportional to the cube of the
frequency and this led to CPUs with a power consumption of up to 250W.CPUs address memory
bandwidth limitations by devoting half or more of the chip to LLC, so that small applications can
be held entirely within the cache. They address the 200-cycle latency issue by using very
complex cores which are capable of out-of-order execution , By contrast, GPUs adopt a very
different design philosophy because of the different needs of the graphical applications they
target. A GPU usually has a number of functional u.
NAND Flash is the nonvolatile memory used in virtually all mobile devices (smartphones, tablets, cameras, game controllers). High performance products (Tablets and smartphones) place increasing demands on NAND
Flash device capacity, cost and bandwidth. To meet these demands, component and application processor designers must utilize a complex combination of electronic hardware and software. As a result benchmarking NAND Flash at the component and system level is a key element in successful product design.
Performance Evaluation of the KVM Hypervisor Running on Arm-Based Single-Boar...IJCNCJournal
Single-Board Computers (SBCs) were initially targeted for education and small projects with low powerprocessing needs. However, their computational power has increased dramatically in the last few years, and they are now used in more advanced developments. In this paper, a study of the feasibility of using ARM-based SBCs as hypervisors is done. The authors selected the Raspberry Pi 4 Model B and the ODROID-N2+ and assessed them as virtualization servers, when running up to four VMs simultaneously, with the Linux de facto hypervisor (KVM). The tests performed in this work include: reading and writing throughputs in different types of storage media, processing power assessment, memory performance, timed compilations of open-source software, and performance of encryption algorithms. The results of the experiments showed that the amount of memory available in these SBCs is a determinant factor about the maximum number of VMs that can be executed simultaneously. The performance of the ODROID-N2+ exceeded the Raspberry Pi 4 Model B. However, the community support received by the latter is huge compared to the one of the former, and this can be a game changer when selecting a viable platform.
PERFORMANCE EVALUATION OF THE KVM HYPERVISOR RUNNING ON ARM-BASED SINGLE-BOAR...IJCNCJournal
Single-Board Computers (SBCs) were initially targeted for education and small projects with low powerprocessing needs. However, their computational power has increased dramatically in the last few years,
and they are now used in more advanced developments. In this paper, a study of the feasibility of using
ARM-based SBCs as hypervisors is done. The authors selected the Raspberry Pi 4 Model B and the
ODROID-N2+ and assessed them as virtualization servers, when running up to four VMs simultaneously,
with the Linux de facto hypervisor (KVM). The tests performed in this work include: reading and writing
throughputs in different types of storage media, processing power assessment, memory performance, timed
compilations of open-source software, and performance of encryption algorithms. The results of the
experiments showed that the amount of memory available in these SBCs is a determinant factor about the
maximum number of VMs that can be executed simultaneously. The performance of the ODROID-N2+
exceeded the Raspberry Pi 4 Model B. However, the community support received by the latter is huge
compared to the one of the former, and this can be a game changer when selecting a viable platform.
PERFORMANCE AND ENERGY-EFFICIENCY ASPECTS OF CLUSTERS OF SINGLE BOARD COMPUTERSijdpsjournal
When a high performance cluster is demanded and the cost for purchase and operation of servers,
workstations or personal computers as nodes is a challenge, single board computers may be an option to
build inexpensive cluster systems. This paper describes the construction of such clusters and analyzes their
performance and energy-efficiency with the High Performance Linpack (HPL) benchmark.
PERFORMANCE AND ENERGY-EFFICIENCY ASPECTS OF CLUSTERS OF SINGLE BOARD COMPUTERSijdpsjournal
When a high performance cluster is demanded and the cost for purchase and operation of servers,
workstations or personal computers as nodes is a challenge, single board computers may be an option to
build inexpensive cluster systems. This paper describes the construction of such clusters and analyzes their
performance and energy-efficiency with the High Performance Linpack (HPL) benchmark.
Top 10 Supercomputers With Descriptive Information & AnalysisNomanSiddiqui41
Top 10 Supercomputers Report
What is Supercomputer?
A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2017, there are supercomputers which can perform over 1017 FLOPS (a hundred quadrillion FLOPS, 100 petaFLOPS or 100 PFLOPS
Supercomputers play an important role in the field of computational science, and are used for a wide range of computationally intensive tasks in various fields, including quantum mechanics, weather forecasting, climate research, oil and gas exploration, molecular modeling (computing the structures and properties of chemical compounds, biological macromolecules, polymers, and crystals), and physical simulations (such as simulations of the early moments of the universe, airplane and spacecraft aerodynamics, the detonation of nuclear weapons, and nuclear fusion). They have been essential in the field of cryptanalysis.
1. The Fugaku Supercomputer
Introduction:
Fugaku is a petascale supercomputer (while only at petascale for mainstream benchmark), at the Riken Center for Computational Science in Kobe, Japan. It started development in 2014 as the successor to the K computer, and started operating in 2021. Fugaku made its debut in 2020, and became the fastest supercomputer in the world in the June 2020 TOP500 list, as well as becoming the first ARM architecture-based computer to achieve this. In June 2020, it achieved 1.42 exaFLOPS (in HPL-AI benchmark making it the first ever supercomputer that achieved 1 exaFLOPS. As of November 2021, Fugaku is the fastest supercomputer in the world. It is named after an alternative name for Mount Fuji.
Block Diagram:
Functional Units:
Functional Units, Co-Design and System for the Supercomputer “Fugaku”
1. Performance estimation tool: This tool, taking Fujitsu FX100 (FX100 is the previous Fujitsu supercomputer) execution profile data as an input, enables the performance projection by a given set of architecture parameters. The performance projection is modeled according to the Fujitsu microarchitecture. This tool can also estimate the power consumption based on the architecture model.
2. Fujitsu in-house processor simulator: We used an extended FX100 SPARC instruction-set simulator and compiler, developed by Fujitsu, for preliminary studies in the initial phase, and an Armv8þSVE simulator and compiler afterward.
3. Gem5 simulator for the Post-K processor: The Post-K processor simulator3 based on an opensource system-level processor simulator, Gem5, was developed by RIKEN during the co-design for architecture verification and performance tuning. A fundamental problem is the scale of scientific applications that are expected to be run on Post-K. Even our target applications are thousands of lines of code and are written to use complex algorithms and data structures. Altho
Dynamic Simulation of Chemical Kinetics in MicrocontrollerIJERA Editor
Arduino boards are interesting computational tools due to low cost and power consumption, as well as I/O ports, both analogs and digitals. Yet, small memory and clock frequency with truncation errors may disrupt numerical processing. This study aimed to design and evaluate the performance of a dynamic simulation based on ODEs in the Arduino, with three evaluated microprocessors; ATMEGA 328P and 2560, both 8 bits, and SAM3X8E Atmel ARM CORTEX – 32 bits. The study case was a batch reactor dynamic simulation. The Runge-Kutta 4th order algorithm was written in C++ and compiled for EPROM utilization. The output was a 115000bit/s serial connection. Processing time was almost identical for 8 bits architectures, while 32 bits was 25% faster. Without the serial connection the 8 bits architectures were 16 times faster and the 32 bits was 42 times faster. Error truncation was similar, since the floating points are done through software. The Arduino platform, despite its modest hardware, allows simple chemical systems simulation.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
Après la petite intro sur le stockage distribué et la description de Ceph, Jian Zhang réalise dans cette présentation quelques benchmarks intéressants : tests séquentiels, tests random et surtout comparaison des résultats avant et après optimisations. Les paramètres de configuration touchés et optimisations (Large page numbers, Omap data sur un disque séparé, ...) apportent au minimum 2x de perf en plus.
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
Yao Yao Mooyoung Lee
https://github.com/yaowser/learn-spark/tree/master/Final%20project
https://www.youtube.com/watch?v=IVMbSDS4q3A
https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform/
Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics
Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ijdpsjournal
This paper studies the performance and energy consumption of several multi-core, multi-CPUs and manycore
hardware platforms and software stacks for parallel programming. It uses the Multimedia Multiscale
Parser (MMP), a computationally demanding image encoder application, which was ported to several
hardware and software parallel environments as a benchmark. Hardware-wise, the study assesses
NVIDIA's Jetson TK1 development board, the Raspberry Pi 2, and a dual Intel Xeon E5-2620/v2 server, as
well as NVIDIA's discrete GPUs GTX 680, Titan Black Edition and GTX 750 Ti. The assessed parallel
programming paradigms are OpenMP, Pthreads and CUDA, and a single-thread sequential version, all
running in a Linux environment. While the CUDA-based implementation delivered the fastest execution, the
Jetson TK1 proved to be the most energy efficient platform, regardless of the used parallel software stack.
Although it has the lowest power demand, the Raspberry Pi 2 energy efficiency is hindered by its lengthy
execution times, effectively consuming more energy than the Jetson TK1. Surprisingly, OpenMP delivered
twice the performance of the Pthreads-based implementation, proving the maturity of the tools and
libraries supporting OpenMP.
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ijdpsjournal
This paper studies the performance and energy consumption of several multi-core, multi-CPUs and manycore
hardware platforms and software stacks for parallel programming. It uses the Multimedia Multiscale
Parser (MMP), a computationally demanding image encoder application, which was ported to several
hardware and software parallel environments as a benchmark. Hardware-wise, the study assesses
NVIDIA's Jetson TK1 development board, the Raspberry Pi 2, and a dual Intel Xeon E5-2620/v2 server, as
well as NVIDIA's discrete GPUs GTX 680, Titan Black Edition and GTX 750 Ti. The assessed parallel
programming paradigms are OpenMP, Pthreads and CUDA, and a single-thread sequential version, all
running in a Linux environment. While the CUDA-based implementation delivered the fastest execution, the
Jetson TK1 proved to be the most energy efficient platform, regardless of the used parallel software stack.
Although it has the lowest power demand, the Raspberry Pi 2 energy efficiency is hindered by its lengthy
execution times, effectively consuming more energy than the Jetson TK1. Surprisingly, OpenMP delivered
twice the performance of the Pthreads-based implementation, proving the maturity of the tools and
libraries supporting OpenMP.
I understand that physics and hardware emmaded on the use of finete .pdfanil0878
I understand that physics and hardware emmaded on the use of finete element methods to predict
fluid flow over airplane wings,that progress is likely to continue. However, in recent years, this
progress has been achieved through greatly increased hardware complexity with the rise of
multicore and manycore processors, and this is affecting the ability of application developers to
achieve the full potential of these systems. currently performance is measured on a dense
matrix–matrix multiplication test which has questionable relevance to real applications.the
incredible advances in processor technology and all of the accompanying aspects of computer
system design, such as the memory subsystem and networking
In embedded it seems to combination of both hardware and the software , it is used to be
combined function of action in the systems .while we do that the application to developed in the
achieve the full potential of the systems in advanced processer technology.
Hardware
(1) Memory
Advances in memory technology have struggled to keep pace with the phenomenal advances in
processors. This difficulty in improving the main memory bandwidth led to the development of a
cache hierarchy with data being held in different cache levels within the processor. The idea is
that instead of fetching the required data multiple times from the main memory, it is instead
brought into the cache once and re-used multiple times. Intel allocates about half of the chip to
cache, with the largest LLC (last-level cache) being 30MB in size. IBM\'s new Power8 CPU has
an even larger L3 cache of up to 96MB [4]. By contrast, the largest L2 cache in NVIDIA\'s
GPUs is only 1.5MB.These different hardware design choices are motivated by careful
consideration of the range of applications being run by typical users.
One complication which has become more common and more important in the past few years is
non-uniform memory access. Ten years ago, most shared-memory multiprocessors would have
several CPUs sharing a memory bus to access a single main memory. A final comment on the
memory subsystem concerns the energy cost of moving data compared to performing a single
floating point computation.
(2) Processors
CPUs had a single processing core, and the increase in performance came partly from an increase
in the number of computational pipelines, but mainly through an increase in clock frequency.
Unfortunately, the power consumption is approximately proportional to the cube of the
frequency and this led to CPUs with a power consumption of up to 250W.CPUs address memory
bandwidth limitations by devoting half or more of the chip to LLC, so that small applications can
be held entirely within the cache. They address the 200-cycle latency issue by using very
complex cores which are capable of out-of-order execution , By contrast, GPUs adopt a very
different design philosophy because of the different needs of the graphical applications they
target. A GPU usually has a number of functional u.
I understand that physics and hardware emmaded on the use of finete .pdf
survey_of_matrix_for_simulation
1. Energy Systems Research Unit Email: esru@strath.ac.uk
Dept. of Mechanical and Aerospace Engineering Tel: +44 (0)141 548 2314
75 Montrose Street, Glasgow G1 1XJ http://www. strath.ac.uk/esru
The University of Strathclyde is a charitable body, registered in Scotland, number SC015263
ESRU occasional paper
Survey of a matrix of computing hardware and
compilation influences on the deployment of ESP-r and
EnergyPlus
Dr. Jon W. Hand
Energy Systems Research Unit
13 October 2015
2. 1
SURVEY OF A MATRIX OF HARDWARE AND COMPILATION INFLUENCES ON1
THE DEPLOYMENT OF ESP-r AND ENERGYPLUS2
3
Dr. Jon W. Hand1
4
1
Energy Systems Research Unit, University of Strathclyde, Glasgow, Scotland5
6
13 October 20157
8
9
10
Contents'11
!12
ABSTRACT .............................................................................................................................................................2!13
INTRODUCTION ....................................................................................................................................................2!14
HARDWARE ISSUES.............................................................................................................................................2!15
Disk types and size ...............................................................................................................................................2!16
Computer memory................................................................................................................................................3!17
Virtual (cache) memory........................................................................................................................................3!18
Processor type.......................................................................................................................................................3!19
Computational platforms considered....................................................................................................................4!20
COMPILER DIRECTIVES......................................................................................................................................4!21
The test models.....................................................................................................................................................5!22
Building models for ESP-r & EnergyPlus............................................................................................................6!23
Data recovery timings...........................................................................................................................................7!24
EnergyPlus models ...............................................................................................................................................8!25
CONCLUSION.........................................................................................................................................................8!26
ACKNOWLEDGEMENT........................................................................................................................................9!27
REFERENCES .........................................................................................................................................................9!28
29
30
31
32
33
3. 2
1
ABSTRACT
This is an interim report from an ongoing
investigation of the relative contribution of various
hardware and compiler options on the efficacy of
specific ESP-r and EnergyPlus simulation tasks. The
paper draws on a range of techniques and
methodologies developed to port ESP-r to ARM
platforms such as the Raspberry Pi and extends them
across a range of traditional and emerging computing
platforms. Adapting to the hardware limits of ultra
low cost computers exposed a number of issues
related to disk access, disk type, memory, number of
computing cores, compiler options as well as the use
of virtual machines for a range of tool development
and simulation tasks. Among other findings is that
CFD convergence times can be reduced from 41s to
less than 10s, annual multi-domain assessments from
950s to 280s and non-optimal memory and virtual
computing can extend data recovery times by more
than a factor of 10.
INTRODUCTION
In 2012 the Raspberry Pi (www.raspberrypi.org) was
introduced, primarily as a vehicle to address a lack of
programming skills in UK schools. Its combination
of price and computational power proved attractive to
a far wider audience, including the 'maker'
community. This spawned competitors such as the
BeagleBone Black (beagleboard.org) and an
ecosystem of related add-on devices as well as a
community of developers testing the bounds of this
new class of device which were typically distributed
with Linux.
Few perceived such lightweight platforms might
support numerical simulation. However, in a
historical context, the development of simulation
tools and many classic numerical studies were
accomplished with even more constrained
computational resources At the eSim conference in
2014 the author presented the results of an initial port
of ESP-r to the original Raspberry Pi and
BeagleBone Black ARM-based computers and
observations of their use as software development
platforms as well as for carrying out various
performance assessment goals for different user
types.
From the subset of simulation tasks that the first
generation supported, subsequent ARM-based
computers, for example the Odroid-U3 from Korea
(www.hardkernel.com), have include multiple cores,
more memory and faster disk access. The user
experience gap between a conventional desktop
computer and the $70 Odoid-U3 is surprisingly
modest. For example, simultaneously editing of a
3200 surface ESP-r model while running a CFD
assessment and an Octave (an open source equivalent
to MatLab) turbine blade analysis session does not
saturate the Odroid's resources. However, less
constrained hardware is only part of the story.
The author observed that particular adjustments to
the numerical source code and to the compiling tool
chain resulted in significant improvements in the
build process, subsequent user interactions and the
run times for assessments. The magnitude of
improvement was dependent on the specifics of the
hardware, the complexity of the model and the nature
of the assessments carried out.
This paper assesses whether the techniques explored
during the ARM study are applicable in a broader
context of numerical tools, machine configurations
and ordering of simulation work tasks. Both ESP-r
(www.esru.strath.ac.uk/publications.htm ) and
EnergyPlus (apps1.eere.energy.gov/buildings/
energyplus/) are used to test this idea. As many of
the constraints noted in ARM platforms are found in
older laptops and workstations the study also assess
the extent of improvement for such computers as
well as for computers which fit the conventional
definition of a numerical workstation. The author
observed that the details of computer hardware, e.g.
type of disk, provision of physical memory and
virtual memory and processor type had an impact on
the time it took to carry out specific simulation tasks
on models of different complexity.
HARDWARE ISSUES
Disk types and size
Most ARM single board computers (SBC) rely on
SDHC cards for disk storage as well as swap space.
SDHC cards were not designed for operating systems
and disk I/O is substantially constrained. Some SBC
and tablets make use of eMMC storage that are mid-
way between SDHC cards and rotational drives in
terms of speed. Conventional computers often
include slower rotational drives rather than SSD
drives. SBC are typically paired with 8GB or 16GB
SDHC cards or eMMC for reasons of cost and this
constrains the space available for the build process
(EnergyPlus requires ~4GB to build) as well as the
space available for simulation files and performance
prediction files.
Another class of drive are the virtual drives
implemented within virtual computers. Tests indicate
that the overheads involved result in disk access
roughly half the speed of the computers native drive.
Disk I/O associated with lots of small files or with
random access files can be a magnitude slower than
sequential reads and writes reported in benchmarks.
ESP-r models may consist of scores of small files and
performance predictions for each domain are held in
4. 3
sequentially written binary files with data recovery
mostly involving random access. Conversly,
EnergyPlus is reading few files and sequentially
writing ASCII files and optionally a SQL database.
Benchmarking tools such as (a) Blackmagick disk
speed test on OSX and (b) CrystalDiskMark on
Windows use various disk I/O tests for sequential
and random access. Table 1 shows their reporting
across a range of devices. This is only somewhat
indicative of the mix of simulation tool data recovery
tasks which have been assessed.
Table 1 Typical disk I/O speeds MB/s
Drive Type Sequential (a) Random (b)
write read write read
SDHC class 10 10-20 20 1.6 5.3
eMMC Odroid 15 55
USB 2 stick 2-8 18 0.03 5.4
USB 3 stick 15-20 60 0.6 4.1
USB 3 rotational 60 65 1.9 0.5
Old 2.5" rotational 40 45
USB 3 SSD 110 245 1.2 1.6
Network drive 110 110 1.2 1.6
T61 SSD 110 130 33 22
T61 Rotational 60 60 0.3 0.4
Dell 7010
rotational
90 77 1.5 1.2
Dell 755 rotational 84 87
Dell 755 virtualbox 42 47
Macbook Air SSD 222 600 99 20
Many simulation tasks are disk-bound. ESP-r has a
number of user choices which impact the size of the
results files created: the extent of performance data
written (i.e. save level), the period of the assessment
and the building and system time step used. An ESP-
r model at the limits of geometric complexity running
several months at a one minute building time step
and systems time step of one second time step might
generate upwards of 40 GB. For example an annual
15 minute time step 1890s villa model constrained
performance file is 354MB and the extensive file is
4.27GB. EnergyPlus also incudes optional directives
which may constrain the number of entities which are
reported on or, for example omit or constrain SQL
outputs..
Computer memory
Constrained resource computers often run Linux
because of the memory footprint of the operating
system. With ~512MB RAM there is still ~400MB
available. One of the initial challenges of porting
ESP-r was to create a suite of executables that could
run in their usual combinations within the available
memory. Rather than purge multi-domain
functionality the route taken was to constrain model
complexity via alternative sets of header files for
small and standard deployments. Small deployments
are targeted at low resource computers, but can be
advantageous when running assessments within
virtual computers.
Memory is also an issue during the build process,
some compile options require substantially more
memory and may drive the whole process into virtual
memory. The build process on ESP-r essentially
doubled in speed when low resource linking
commands are used. EnergyPlus is hungry for
memory and disk space during the build process and
was rarely successful if there is less than ~800MB
RAM and 4GB of free disk space available.
Given sufficient memory, operating systems use
memory as a buffer for the disk I/O associated with
simulations and subsequent data-mining. There is a
considerable speedup if free memory is greater than
the size of the files being written. It is also the case
that where multiple assessments need to be carried
out it is often much faster to:
simulate_a extract_a simulate_b extract_b
rather than
simulate_a simulate_b extract_a extract_b.
Thus, critical adjustments to the scope of assessment
or the ordering of tasks can reduce the penalty for
data extraction across most platforms.
Virtual (cache) memory
Virtual memory via a swap file on the disk is used
when physical memory is depleted. If slow disks are
combined with limited memory then swap is
increasingly used and performance degrades. This
was evident in the initial porting of ESP-r to the
Raspberry Pi.
Tools such as ESP-r are composed of many modules,
for example the ESP-r project manager (prj) will
invoke the simulation engine (bps) and later the
results analysis module (res). Although usually not
an issue in conventional deployments, with
constrained memory simulation executables need to
be constrained in size to avoid running in swap. The
approach taken in ESP-r is to have alternative header
files that support different levels of model
complexity. This is also helpful when running
simulation within virtual environments or where
separate processor cores are used for parallel
assessments.
Processor type
ESP-r has had a long history of deployment on
different computer platforms and operating systems.
One of the challenges of the study was to build
EnergyPlus on ARM. Although most users perceive
EnergyPlus as a simulation engine, it runs within the
context of a set of pre-processing and post-processing
utilities. The simulation engine is relatively
straightforward to compile from the Fortran source,
however the utilities (e.g. ) in the standard 8.1
distributions are pre-compiled executables. It proved
5. 4
difficult to compile the complete set of utilities from
scratch for use on ARM. Eventually the 8.2
EnergyPlus source was used.
Computational platforms considered
The context of the study is a range of computers
spanning ultra low cost computers, tablets,
legacycomputers as well as conventional
workstations. The mix of hardware configurations
allowed many of the hardware sensitivities to be
explored. For this study, most comparisons are done
with computers configured to run Linux or emulating
Linux. With minor variants, the command syntax,
form of user interaction, benchmarking tools,
operating system resource requirements and support
for scripting are roughly consistent. The list below
summarises the computer platforms in terms of .
Name, computer type, CPU, CPU speed, RAM,
Operating system, compiler, swop space and epoch.
1. Dell 7010, desktop, 4x Intel i5-3470 @3.2 GHz,
8GB ram, Ubuntu 14.04.1 LTS. Linux 3.11.0,
GCC 4.8.2, Cache 8061 Mb, 2012
2. Macbook Air, laptop, 2x Intel i5 @ 1.3 GHz,
4GB RAM, OSX 10.8.3, GCC 4.7, Cache 4GB,
SSD, 2013
3. Dell 755 2x, desktop, Intel Core 2 Duo E6550 @
2.33GHz 3.91GB RAM, Mint 16, GCC 4.8.1,
Cache 4049 MB, 2007
4. Dell 755 virtualbox 1.9GB memory 1 processor,
WattOS, GCC 4.7.2
5. IBM thinkpad T61, laptop, Intel Core2 2.0 GHz
2GB RAM Linux Mint, GCC 4.8, SSD, 2007
6. IBM thinkpad T61, laptop, Intel Core2 2.0 GHz
2GB RAM Linux Mint, GCC 4.8
7. EeePC901, netbook, Atom N270, 1.6GHz,1GB
RAM, WattOS, GCC 4.7.2 , Cache 512 Mb,
SSD, 2008
8. Odroid, SBC, ARMv7l, 2GB RAM, Ubuntu
14.04 GCC 4.8.2 , Cache 0 Mb, eMMC, 2014
9. HUDL, tablet, ARMv7l 1GB RAM, Debian 7.7,
GCC 4.6.3, Cache 0 Mb, eMMC, 2013
10. IBM T23, laptop, P-III 1.2Ghz 1GB RAM,
Vector Linux, GCC 4.5.2, Cache 1024Mb,2003
11. Raspberry Pi 2 SBC, ARMv7l, 762MB RAM,
Debian 7.8, GCC 4.6.3, Cache 921 Mb, SDHC,
2015
12. BBB, SBC, ARMv7l, 507GB RAM, Linux
3.8.13, ? GCC 4.7.2 , Cache 820 Mb, SDHC,
2013
13. Raspberry Pi, SBC, ARMv6l, 481MB RAM,
Debian 7.2, GCC 4.6.3, Cache 921 Mb, SDHC,
2012
Standard Linux Hardinfo benchmarks
<sourceforge.net/projects/hardinfo.berlios> are
shown in Table 2 (ordered by the FFT benchmark).
In the table VB denotes a virtual computer. Notice
the impact of running a virtual computer. The newer
ARM processors are in the same magnitude as older
laptops for single core performance. However they
tend to perform rather better for simulation tasks than
the benchmarks indicate.
Table 2 Typical benchmarks
Computer CPU
Blowfish
Crypto-
hash
Fabo-
nacci
FFT
Dell 7010 2.5 712.7 1.4 0.7
Dell 755 7.4 187.0 3.7 3.5
Dell 755 VB 14.6 89.7 3.8 7.1
IBM T61 8.7 161.1 4.1 3.9
IBM T23 37.5 32.6 8.9 23.1
EeePC901 20.6 41.2 7.5 28.5
HUDL 30.3 35.0 7.5 29.5
Odroid 28.4 37.0 6.9 28.8
Raspberry
Pi 2
55.5 17.2 14.5 69.9
BBB 47.6 - 14.7 74.3
Raspberry
Pi (orig)
73.8 - 21.7 119.
4
In practice, both EnergyPlus and ESP-r executables
run on a single core. ESP-r is a suite of applications
so there are usually more than one application active
and thus a second core is useful. This study found
that some sequential and most parallel invocations of
simulation assessments were disk-bound.
COMPILER DIRECTIVES
The impact of options in the compilation tool chain
have rarely been discussed within the simulation
community. Some tools which are normally compiled
from source, such as Radiance default to directives
for speed of execution. The standard distribution of
EnergyPlus is also optimised for speed. ESP-r,
which is distributed as source, the Install script
included no optimisation directives before is study.
The compiler tool chain is based on GNU GCC so
similar optimisations to be applied to both ESP-r and
EnergyPlus. The optimisation directives
(gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html )
for the GCC compiler suite are:
1. -O0 (no optimisation, fastest compile time,
useful for debug sessions),
2. -O1 (attempts to reduce code size and execution
time with least impact on compilation time, more
memory required),
3. -O2 (additional attempt to reduce execution
time, increased compilation time and memory
required)
4. -O3 (additional in-line function and loop
optimisation, even longer compile time)
To quantify the impact of alternative build directives,
both ESP-r and EnergyPlus were re-compiled using
the O0, O1, O2 and O3options. For low resource
computers, the small version header files were used.
Table 3 shows ESP-r build times. For most platforms
there is are significant time impacts from O0 to O1
6. 5
but little between O2 and O3. The full install
included building the databases and configuring 280
training and validation models. Virtual box (VB) are
Linux running under the host Linux. Rotational (rot)
and SSD are also listed separately.
Table 3 Build time for ESP-r
Computer Install type -O0 -O1 -O2 -O3
Dell 7010 full 6m5s 13m34s 17m48 18m16s
Mac Air full 8m38s 12m22s 17m5s 18m39s
Dell 755 full 11m6s 28m2s 37m53s 39m17s
Dell 755 VB full 18m22s? 24m28s? 30m59s 36m36s
IBM T61 rot full 12m12s
IBM T61 SSD full 11m39s 30m03s 39m50s 41m51s
IBM T61 SSD
W7 MSYS2
full 27m58s 33m19s 53m35
IBM T23 full 51m14s 91m 113m 140m
Odroid full 34m48s 61m 92m 98m
EeePC 901 full 39m15s 63m 84m 91m
Raspberry Pi 2 full 60m8s 125m 170m 177m
Raspberry Pi full 106m 329m 478m -
Clearly for development work the -O0 option has
benefits, especially given the time involved for low-
resource computers. The subsequent discussion about
simulation task times will clearly demonstrate the
benefit of distributing -O1 or -O2 executables across
all machine types, especially older and low resources
computers.
The details of EnergyPlus builds is shown below for
two platforms where EnergyPlus and all the utilities
were compiled from scratch. For OSX and Linux on
Intel the standard distribution utilities were used with
separate EnergyPlus versions compiled with -O0 -O1
-O2 -O3.
• Dell 7010 make install -O0 47m27s
• Raspberry Pi 2 make install -O0 231m24s
The test models
Models were selected for ESP-r and EnergyPlus
representing different levels of complexity as well as
exercising different solvers. For CFD assessments
the suite of CFD benchmarks developed by Ian
Beausoleil-Morrison (2000) have been used. The four
cases highlighted are:
1. basic.dfd 960 cells, 1 inlet, 1 outlet, no
blockages k-e pressure and velocity solved ~260
iterations
2. porous.dfd 960 cells, 1 inlet, 1 outlet 18
blockages k-e pressure and velocity solved ~400
iterations
3. bi-cg.dfd 24360 cells, 4 inlets, 4 outlets k-e solve
pressure velocity ~2000 iterations
Figure 1 shows a variant with an wall inlet, ceiling
extract and a grid of internal blockages.
Figure 1 CFD domain with internal blockages
CFD is numerically intensive with little disk I/O.
Looking at CFD performance across the matrix of
hardware and software optimisation in Table 4, it is
clearly possible to use resource-constrained
computers in combination with high optimisation.
ARM has a big step in performance between O0 &
O1, moderate gains with O2 while O3 is marginal.
OSX has a big step in performance from O1 & O2
and marginal gains from O3. Linux on Intel has a big
step from O0 to O1 and little or no improvement with
O2 & O3. When running on a virtual computer a -O1
or O2 will perform roughly in line with un-optimised
software on the host computer. The Raspberry Pi 2
optimises better than the BBB, perhaps because of
differences in the ARM chip implementation.
7. 6
Table 4 CFD performance matrix (seconds)
basic.dfd -O0 -O1 -O2 -O3
Dell 7010 1.3 <1 <1 <1
Dell 7010 VB 2.2 1.3 <1 <1
MacAir 2.3 2.2 <1 <1
Dell 755 2.8 1.3 1.2 1.2
Dell 755 VB 3.6 2.3 1.8 -
IBM T61 2.7 1.3 1.2 1.1
IBM T61 W7 3.5 1.7 1.6 -
IBM T23 10.1 7.1 5.5 7.0
Odroid 12.7 4.5 4.0 4.1
HUDL 14.5 4.7 4.2 -
Rasp Pi2 30.6 12.0 9.7 9.6
BBB 30.3 16.9 17.0 15.8
Rasp Pi 51.6 25.9 - 61.7
porous.dfd -O0 -O1 -O2 -O3
Dell 7010 4.0 1.5 1.5 1.5
Dell 7010 VB 5.6 2.6 2.1 -
Air 6.5 6.2 2.2 2.1
Dell 755 8.2 3.3 3.2 3.1
Dell 755 VB 8.8 4.9 4.4 -
IBM T61 8.7 3.3 3.2 3.6
IBM T61 W7 11.2s 4.2 4.0 -
IBM T23 29.0 15.6 13.6 15.0
Odroid 37.1 11.7 9.6 -
HUDL 41.7 10.0 10.1 10.0
BBB 79.1 33.7 33.7 32.6
Rasp Pi2 84.2 26.7 23.1 22.9
Rasp Pi 127.0 55.6 - 38.4
Bi_cg.dfd -O0 -O1 -O2 -O3
Dell 7010 229 127 127 122
Dell 7010 VB 357 227 227- -
Air 410.6 391.0 215 205
Dell 755 599.5 357 338 332
Dell 755 VB 670 574 565 -
IBM T61 544 351 332 326
IMB T61 W7 734 486 480 -
IBM T23 2378 2202 1742 2279
Odroid 2284 897 758 -
Rasp Pi2 - - - 2750
Building models for ESP-r & EnergyPlus
Student scale models are characterised by the cellular
office (Figure 2). ESP-r includes a dozen variants of
this model exploring various simulation facilities. For
EnergyPlus, the standard Supermarket.idf was used.
Assessment tasks include different periods and are
reported in Table 5. For example, a spring week for
initial model calibration and reality checks. Every
platforms could carry out the task takes less than 30
seconds. A January-February assessment explores the
distribution of peak and typical demands and all
platforms could carry out this task in less than one
minute. The four month summer assessment
highlights the benefits of optimisation. Lastly, we see
that annual assessment are problematic for low
resource machines without at least an -O1
optimisation. Four cores and single cores produce
similar timings for sequential tasks. Although the Pi
can run 4 simultaneous simulations, disk access is the
bottleneck.
Figure 2 Student scale model
A simple model from the point of view of
EnergyPlus is Supermarket.idf. For this study it was
adapted for an annual assessment at four time steps
per hour. The difference in performance between the
two Dell computers is substantial with the Odroid
ARM computer approaching the performance of the
older Dell more than the difference in performance
between the older Dell and the Odroid for ESP-r
assessments.
EnergyPlus simple model performance predictions
• Dell 7010 -O3 compile 0m34s
• Dell 755 -O3 compile 4m10s
• Odroid -O3 compile 5m56s
• Raspberry Pi 2 -O3 compile 10m10s
Table 5
Cellular office performance matrix (seconds)
Dell 7010
Period -O0 -O1 -O2 -O3
one week <1 <1 <1 <1
summer 4.1 2.1 2.0 -
annual 11.7 5.7 5.6 -
Air
one week <1 <1 <1 <1
summer 5.8 6.3 2.6 -
annual 16.4 7.3 6.8 -
Dell 755
one week 1.2 <1 <1 <1
summer 7.7 4.2 3.9 3.8
annual 21.9 11.6 10.4 10.2
T61 rotational
one week 1.3 <1 <1 <1
summer 8.2 4.7 4.4 4.4
annual 23.4 12.9 11.8 11.8
8. 7
T61 SSD
one week 1.2 <1 <1 <1
summer 8.0 4.3 3.9 3.8
annual 22.8 11.7 10.6 10.4
Odroid
one week - 2.9 2.8 2.7
summer - 15.5 14.3 14.6
annual - 42.9 40.0 40.9
Raspberry Pi 2
one week 5.7 5.9 5.5 5.4
summer 82.3 38.6 35.3 34.2
annual 247.5 108.7 102.4 97.4
BBB
one week 14.1 9.0 8.9 8.9
summer 126.9 77.7 76.2 79.7
The second building model tested is an 1890s Stone
Villa (Figure 3) which has typically been used to
assess refurbishment options and its geometric and
compositional detail reflects this. This model
includes 13 thermal zones and 432 surfaces. The
composition includes a mix of lightweight and heavy
entities (outer walls with 600mm of various stone
types). There is considerable diversity of room use
throughout the day and for different day types.
Figure 3 Moderate complexity model
The ESP-r model was exported from ESP-r as a V7.2
IDF file and then upgraded via the usual utilities to
an 8.2 IDF. Two variants were created, a base case
using conduction transfer functions at the same time
step as ESP-r used and the other uses the finite
difference solver at 20 time step per hour. The finite
difference solver directives would be roughly
analogous to the finite volumes used in ESP-r.
In this case, the size of the results files can become
an issue. ESP-r supports multiple save levels - save 4
which includes a full energy balance at all zones and
all surfaces and save 2 which does not include the
energy balance for surfaces. For example the stone
simi ESP-r model a one week assessment is 6.9MB
and 82.4MB, a two month assessment is 57MB and
693MB, summer 118MB and 1.43GB and an annual
assessment is 354MB and 4.28 GB.
The performance matrix for ESP-r is shown in Table
6. The entries marked s2 are for constrained
performance data and s4 include a full energy
balance. Of interest is the improvement across all
platforms, especially as model complexity increases
and for the summer or annual assessments. All other
factors being consistent Linux annual run time
reductions from 950s to 230s and OSX run time
reductions from 744s to 181s has generated a number
of oh-my-goodness reactions from users. It was
possible for an optimised Dell 755 ESP-r to surpass a
newer but un-optimised Dell 7010 ESP-r. Indeed a
fully optimised Raspberry Pi 2 approached the
Lenovo T61 un-optimised performance.
Data recovery timings
The extraction of data from ESP-r results files is not
particularly sensitive to the level of build
optimisation. Rather it depends on the nature of the
disk drive, the available memory and the extent of
the results file being scanned. The SDHC cards in
several of the SBC were seen to be especially slow
for scanning large results files. Similarly, constrained
memory prevented data recovery from memory
buffer and forced disk reads for several of the cases.
The Table 7 shows timings including runs which
were impacted by lack of free memory (*)
Table 6 Stone villa performance matrix (seconds)
Dell 7010
Period -O0 -O1 -O2 -O3
s2 week 29.8 10.5 9 -
s2 summer 164 57 51 -
s2 annual 448 164 139 -
s4 summer 167 56 54
s4 annual 459 164 148
Dell 755
s2 summer 347 128 102 103
s2 annual 953 352 281 280
s4 summer 355 133 108 107
s4 annual 974 368 296 294
Dell 755 Virtual Box WattOS
s2 summer 371 162 163 162
s2 annual 1010 531 465
s4 summer 340 199 145
s4 annual - - - -
Macbook Air
s2 week 47 14 12 12
s2 summer 258 78 66 66
s2 annual 774 209 181 181
s4 summer 263 81 70 73
s4 annual 723 226 194 203
T61 rotational
s2 week 67 28.4 23.6 23.2
s2 summer 369 157 131 130
s2 annual 1010 430 351 361
9. 8
s4 summer 375 163 146 137
s4 annual 1031 444 395 368
T61 SSD W7
s2 week 93 41 30
s2 summer 497 198 150
s2 annual 1388 533 430
Raspberry Pi 2
s2 week 704 270 220 209
s2 summer 3925 - 1190 1159
s2 annual 10803 - 3217 3218
s4 week 707 - 219 213
s4 summer 3943 - 1225 1238
Table 7 Data recovery matrix (elapsed seconds)
Computer one
week
Jan-
feb
summer annual
Cellular office model
Dell 7010 <1 <1 1 2
Mac Air <1 1 2 4
Dell 755 <1 1.5 2.2 4-5
T61 (rot) 1 1.5 2.5 4-6
T61 SSD 1 1.8 2.9 5-7
Odroid 2.4 4s 6 12
Rasp Pi 2 5 8 12 47
BBB 5 12 19 49
Stone villa model (constrained performance data)
Dell 7010 <1 3 5 8
Mac Air 1 7 13 37
Dell 755 1.5 6 13 36
Dell 755VB 3 9 17 52
T61 (rot) 2 7 14 41
T61 SSD 2 6 9 26
Odroid 6 25 49 140
Rasp Pi 2 9 28 52 140
BBB 9 29 54 150
Stone villa model (with full energy balance)
Dell 7010 1 5 10 15
Mac Air 2 13 32 390*
Dell 755 2 12 18 68
Dell 755VB 4 47 189*
T61 (rot) 3 14 35 247*
T61 SSD 3 19 43 369*
Odroid 4-9 50 104 -
Raspb Pi 2 9-10 41-48 180* -
BBB 10-11 141* 297* -
EnergyPlus models
The Supermarket.idf model, like the many of the
example models distributed with EnergyPlus, can be
used for calibration assessments or non-annual
assessments on the platforms studied.
The impact of compiler optimisations on conduction
transfer and finite difference assessments is shown
below for annual EnergyPlus 8.2 runs of the Stone
villa. The tool-chain optimisation improvements for
EnergyPlus are roughly in line with the pattern seen
with the ESP-r build process.
EnergyPlus annual run timings:
Dell 7010 -O1 2m48s with conduction transfer
Dell 7010 -O3 0m32s with conduction transfer
Dell 7010 -O1 59m20s with finite difference solution
Dell 7010 -O3 10m46s with finite difference solution
Dell 755 -O0 compile 9m55s
Dell 755 -O1 compile 4m41s
Dell 755 -O2 compile 3m32s
Dell 755 -O0 compile, finite difference 192m8s
Dell 755 -O1 compile, finite difference 102m58s
Dell 755 -O2 compile, finite difference 76m30s
Mac Air EnergyPlus 8.1 standard distribution 1m7s
Raspberry Pi 2 -O1 30m4s with conduction transfer
Raspberry Pi 2 -O3 12m34s with conduction transfer
Raspberry Pi 2 -O1 591m55s with finite difference
Raspberry Pi 2 -O3 194m28s with finite difference
It is unclear why the Dell 755 is so much less suited
to EnergyPlus finite difference production work. For
models which make use of the finite difference solver
low resource computers have a distinct disadvantage.
The GCC optimisations yield the expected pattern in
run time changes for both Fortran and C++.
However, the -O3 optimisation with GCC delivers a
less performance than the compiler used by the
EnergyPlus development team.
CONCLUSION
A matrix of computer hardware and software options
have been tested against a range of ESP-r and
EnergyPlus simulation models and for a range of
simulation tasks. Timings for numerical tasks and
performance recovery tasks have been reported.
What is more difficult to quantify in terms of timings
are user interactions associated with creating and
evolving models. Typically, these tasks require a
fraction of the available computing resource and here
the user experience of using low resource and older
hardware is less marked than standard numerical
benchmarks would suggest. The use of compiler
optimisation directives removes most of the latency
in the drawing of wire-frames and in the navigation
of models. Creating and evolving with models of
moderate complexity on all but the most constrained
of platforms would likely be acceptable to many
practitioners. Optimised software and hardware has
thus been seen to expand options for deploying
simulation to non-traditional platforms
10. 9
It has been seen that ARM processors support a
higher degree of optimisation followed by Intel
Linux and then Windows 7. For ESP-r it makes sense
to adopt at least the -O1 level for software to be
distributed. On Intel computers optimisation beyond -
O1 produces marginal improvements for a massive
increase in build times. Developers building
EnergyPlus may choose to debug with –O0 but
should remember to rebuild with –O3 for
distribution.
Without any hardware changes, a 2007 Dell 755 with
optimised software was seen to perform similarly to
an un-optimised 2012 Dell 7010 for a number of
simulation tasks. Similarly, optimised software can
make up for much of the numerical inefficiency in
the use of virtual machines. A Raspberry Pi 2 has
been used for student projects essentially without
comment if the software is fully optimised and
instructions avoid the generation of large files and
extensive data mining tasks.
There are clear indications that critical adjustments to
the scope of assessments and simulation work flow
can improve production tasks by avoiding the use of
virtual memory and ensuring that data recovery can
make use of reads from the memory buffer rather
than disk.
The study provides evidence that careful selection of
refurbishment options for legacy hardware can
extend their life considerably. For example, the
combination of a replacement SSD and optimally
compiled software increased user productivity on
2006 laptops and workstations.
Legacy hardware that can no longer run Windows
XP can usually be repurposed as Linux computers
capable of a number of simulation related work tasks.
For example, a re-configuredNetbook which was
seen to be in line with the better ARM SBC for
browsing models and checking details while visiting
sites for consulting projects.
Although ESP-r is natively hosted on Windows
platforms, tests show that, on the same hardware
ESP-r is roughly 30% slower on Windows 7. This
might be because of OS resource requirements and it
might be due to inefficient use of virtual memory.
This suggests that there may be additional
optimisation techniques to be explored for the
Windows platform.
The full matrix of computers/ compilers/ operating
system variants and observations is being compiled
and extended for a journal paper.
ACKNOWLEDGEMENT
Some of the computers used in this study were
sourced within the University of Strathclyde. Critical
advise on compiling EnergyPlus came from Linda
Laurie.
REFERENCES
Beausoleil-Morrison, I. 2000. The adaptive coupling
of heat and airflow modelling within dynamic
whole-building simulation, Glasgow, University
of Strathclyde.
Hand J. 2014. Opportunities and constraints in the
use of simulation on low cost ARM-based
computers. eSim Conference, Ottawa Canada.
Hand, J. 2015. Strategies for deploying virtual
representations of the built environment,
Glasgow Scotland.