The document outlines Unai Lopez Novoa's PhD dissertation on efficiently using general purpose coprocessors, with kernel density estimation as a case study. It introduces the motivation and challenges of porting applications to accelerators. It then describes the contributions of a novel efficient kernel density estimation algorithm called S-KDE and its implementation for multi-core and many-core processors and general purpose coprocessors. Finally, it proposes a methodology for environmental model evaluation based on S-KDE.
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
RAMSES: A new project in data-driven analytical modeling of distributed systems
RAMSES is a new DOE-funded project on the end-to-end analytical performance modeling of science workflows in extreme-scale science environments. It aims to link multiple threads of inquiry that have not, until now, been adequately connected: namely, first-principles performance modeling within individual sub-disciplines (e.g., networks, storage systems, applications), and data-driven methods for evaluating, calibrating, and synthesizing models of complex phenomena. What makes this fusion necessary is the drive to explain, predict, and optimize not just individual system components but complex end-to-end workflows. In this talk, I will introduce the goals of the project and some aspects of our technical approach.
Talk @ APT Group, University of Manchester, 06 August 2014
Abstract:
Nowadays HPC systems, such as those in the Top500, are equipped with a range of different processors, from multi-core CPUs to GPUs. Programming them can be a tough job, specially if we want to squeeze every last FLOPs of performance out of them.
As a Phd Student, I am now doing a brief research visit in the APT group, working in topics related to the programmability and efficient use of GPUs and many-core coprocessors. In particular, I am implementing a large database operation using OpenCL in these state-of-the-art systems. In this talk I will summarize my work in Manchester and discuss the future work in this topic.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/auvizsystems/embedded-vision-training/videos/pages/may-2015-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nagesh Gupta, CEO and Founder of Auviz Systems, presents the "Trade-offs in Implementing Deep Neural Networks on FPGAs" tutorial at the May 2015 Embedded Vision Summit.
Video and images are a key part of Internet traffic—think of all the data generated by social networking sites such as Facebook and Instagram—and this trend continues to grow. Extracting usable information from video and images is thus a growing requirement in the data center. For example, object and face recognition are valuable for a wide range of uses, from social applications to security applications. Deep neural networks are currently the most popular form of convolutional neural networks (CNN) used in data centers for such applications. 3D convolutions are a core part of CNNs. Nagesh presents alternative implementations of 3D convolutions on FPGAs, and discusses trade-offs among them.
Low Power High-Performance Computing on the BeagleBoard Platforma3labdsp
The ever increasing energy requirements of supercomputers and server farms is driving the scientific and industrial communities to take in deeper consideration the energy efficiency of computing equipments. This contribution addresses the issue proposing a cluster of ARM processors for high-performance computing. The cluster is composed of five BeagleBoard-xM, with one board managing the cluster, and the other boards executing the actual processing. The software platform is based on the Angstrom GNU/Linux distribution and is equipped with a distributed file system to ease sharing data and code among the nodes of the cluster, and with tools for managing tasks and monitoring the status of each node. The computational capabilities of the cluster have been assessed through High-Performance Linpack and a cluster-wide speaker diarization algorithm, while power consumption has been measured using a clamp meter. Experimental results obtained in the speaker diarization task showed that the energy efficiency of the BeagleBoard-xM cluster is comparable to the one of a laptop computer equipped with a Intel Core2 Duo T8300 running at 2.4 GHz. Furthermore, removing the bottleneck due to the Ethernet interface, the BeagleBoard-xM cluster is able to achieve a superior energy efficiency.
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
Orbital representations that are based on B-splines are widely used in quantum Monte Carlo (QMC) simulations of solids, which historically take as much as 50 percent of the total runtime. Random access to a large four-dimensional array make it challenging to efficiently use caches and wide vector units in modern CPUs. So, we present node-level optimizations of B-spline evaluations on multicore and manycore shared memory processors.
To increase single instruction multiple data (SIMD) efficiency and bandwidth utilization, we first apply data layout transformation from an array of structures (AoS) to a structure of arrays (SoA). Then, by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations, resulting with performance enhancements. Finally, we employ roofline performance analysis to model the impacts of our optimizations.
[EUC2016] DockerCap: a software-level power capping orchestrator for Docker c...Matteo Ferroni
Internet of Things (IoT) is experiencing a huge hype these days, thanks to the increasing capabilities of embedded devices that enable their adoption in new fields of application (e.g. Wireless Sensor Networks, Connected Cars, Health Care, etc.). On the one hand, this is leading to an increasing adoption of multi-tenancy solutions for Cloud and Fog Computing, to analyze and store the data produced. On the other hand, power consumption has become a major concern for almost every digital system, from the smallest embedded circuits to the biggest computer clusters, with all the shades in between. Fine-grain control mechanisms are then needed to cap power consumption at each level of the stack, still guaranteeing Service Level Agreements (SLA) to the hosted applications. In this work, we propose DockerCap, a software-level power capping orchestrator for Docker containers that follows an Observe-Decide-Act loop structure: this allows to quickly react to changes that impact on the power consumption by managing resources of each container at run-time, to ensure the desired power cap. We show how we are able to obtain results comparable with the state of the art power capping solution provided by Intel RAPL, still being able to tune the performances of the containers and even guarantee SLA constraints.
Full paper: http://ieeexplore.ieee.org/document/7982228/
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
RAMSES: A new project in data-driven analytical modeling of distributed systems
RAMSES is a new DOE-funded project on the end-to-end analytical performance modeling of science workflows in extreme-scale science environments. It aims to link multiple threads of inquiry that have not, until now, been adequately connected: namely, first-principles performance modeling within individual sub-disciplines (e.g., networks, storage systems, applications), and data-driven methods for evaluating, calibrating, and synthesizing models of complex phenomena. What makes this fusion necessary is the drive to explain, predict, and optimize not just individual system components but complex end-to-end workflows. In this talk, I will introduce the goals of the project and some aspects of our technical approach.
Talk @ APT Group, University of Manchester, 06 August 2014
Abstract:
Nowadays HPC systems, such as those in the Top500, are equipped with a range of different processors, from multi-core CPUs to GPUs. Programming them can be a tough job, specially if we want to squeeze every last FLOPs of performance out of them.
As a Phd Student, I am now doing a brief research visit in the APT group, working in topics related to the programmability and efficient use of GPUs and many-core coprocessors. In particular, I am implementing a large database operation using OpenCL in these state-of-the-art systems. In this talk I will summarize my work in Manchester and discuss the future work in this topic.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/auvizsystems/embedded-vision-training/videos/pages/may-2015-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nagesh Gupta, CEO and Founder of Auviz Systems, presents the "Trade-offs in Implementing Deep Neural Networks on FPGAs" tutorial at the May 2015 Embedded Vision Summit.
Video and images are a key part of Internet traffic—think of all the data generated by social networking sites such as Facebook and Instagram—and this trend continues to grow. Extracting usable information from video and images is thus a growing requirement in the data center. For example, object and face recognition are valuable for a wide range of uses, from social applications to security applications. Deep neural networks are currently the most popular form of convolutional neural networks (CNN) used in data centers for such applications. 3D convolutions are a core part of CNNs. Nagesh presents alternative implementations of 3D convolutions on FPGAs, and discusses trade-offs among them.
Low Power High-Performance Computing on the BeagleBoard Platforma3labdsp
The ever increasing energy requirements of supercomputers and server farms is driving the scientific and industrial communities to take in deeper consideration the energy efficiency of computing equipments. This contribution addresses the issue proposing a cluster of ARM processors for high-performance computing. The cluster is composed of five BeagleBoard-xM, with one board managing the cluster, and the other boards executing the actual processing. The software platform is based on the Angstrom GNU/Linux distribution and is equipped with a distributed file system to ease sharing data and code among the nodes of the cluster, and with tools for managing tasks and monitoring the status of each node. The computational capabilities of the cluster have been assessed through High-Performance Linpack and a cluster-wide speaker diarization algorithm, while power consumption has been measured using a clamp meter. Experimental results obtained in the speaker diarization task showed that the energy efficiency of the BeagleBoard-xM cluster is comparable to the one of a laptop computer equipped with a Intel Core2 Duo T8300 running at 2.4 GHz. Furthermore, removing the bottleneck due to the Ethernet interface, the BeagleBoard-xM cluster is able to achieve a superior energy efficiency.
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
Orbital representations that are based on B-splines are widely used in quantum Monte Carlo (QMC) simulations of solids, which historically take as much as 50 percent of the total runtime. Random access to a large four-dimensional array make it challenging to efficiently use caches and wide vector units in modern CPUs. So, we present node-level optimizations of B-spline evaluations on multicore and manycore shared memory processors.
To increase single instruction multiple data (SIMD) efficiency and bandwidth utilization, we first apply data layout transformation from an array of structures (AoS) to a structure of arrays (SoA). Then, by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations, resulting with performance enhancements. Finally, we employ roofline performance analysis to model the impacts of our optimizations.
[EUC2016] DockerCap: a software-level power capping orchestrator for Docker c...Matteo Ferroni
Internet of Things (IoT) is experiencing a huge hype these days, thanks to the increasing capabilities of embedded devices that enable their adoption in new fields of application (e.g. Wireless Sensor Networks, Connected Cars, Health Care, etc.). On the one hand, this is leading to an increasing adoption of multi-tenancy solutions for Cloud and Fog Computing, to analyze and store the data produced. On the other hand, power consumption has become a major concern for almost every digital system, from the smallest embedded circuits to the biggest computer clusters, with all the shades in between. Fine-grain control mechanisms are then needed to cap power consumption at each level of the stack, still guaranteeing Service Level Agreements (SLA) to the hosted applications. In this work, we propose DockerCap, a software-level power capping orchestrator for Docker containers that follows an Observe-Decide-Act loop structure: this allows to quickly react to changes that impact on the power consumption by managing resources of each container at run-time, to ensure the desired power cap. We show how we are able to obtain results comparable with the state of the art power capping solution provided by Intel RAPL, still being able to tune the performances of the containers and even guarantee SLA constraints.
Full paper: http://ieeexplore.ieee.org/document/7982228/
VaMoS 2022 - Transfer Learning across Distinct Software SystemsLuc Lesoil
Many research studies predict the performance of configurable software using machine learning techniques, thus requiring large amounts of data. Transfer learning aims to reduce the amount of data needed to train these models and has been successfully applied on different executing environments (hardware) or software versions. In this paper we investigate for the first time the idea of applying transfer learning between distinct configurable systems. We design a study involving two video encoders (namely x264 and x265) coming from different code bases. Our results are encouraging since transfer learning outperforms traditional learning for two performance properties (out of three). We discuss the open challenges to overcome for a more general application.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2021/02/introduction-to-the-tvm-open-source-deep-learning-compiler-stack-a-presentation-from-octoml/
Luis Ceze, Co-founder and CEO of OctoML, a Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, and Venture Partner at Madrona Venture Group, presents the “Introduction to the TVM Open Source Deep Learning Compiler Stack” tutorial at the September 2020 Embedded Vision Summit.
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms — such as mobile phones, embedded devices, and accelerators — requires significant manual effort.
In this talk, Ceze presents his work on the TVM stack, which exposes graph- and operator-level optimizations to provide performance portability for deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of optimizations.
ACTRESS: Domain-Specific Modeling of Self-Adaptive Software ArchitecturesFilip Krikava
Presentation given at 29th Symposium On Applied Computing (SAC'14) - Dependable and Adaptive Distributed Systems track.
It is mainly based on the work done during my Ph.D.
El Barcelona Supercomputing Center (BSC) fue establecido en 2005 y alberga el MareNostrum, uno de los superordenadores más potentes de España. Somos el centro pionero de la supercomputación en España. Nuestra especialidad es la computación de altas prestaciones - también conocida como HPC o High Performance Computing- y nuestra misión es doble: ofrecer infraestructuras y servicio de supercomputación a los científicos españoles y europeos, y generar conocimiento y tecnología para transferirlos a la sociedad. Somos Centro de Excelencia Severo Ochoa, miembros de primer nivel de la infraestructura de investigación europea PRACE (Partnership for Advanced Computing in Europe), y gestionamos la Red Española de Supercomputación (RES). Como centro de investigación, contamos con más de 456 expertos de 45 países, organizados en cuatro grandes áreas de investigación: Ciencias de la computación, Ciencias de la vida, Ciencias de la tierra y aplicaciones computacionales en ciencia e ingeniería.
Self-adaptive container monitoring with performance-aware Load-Shedding policies, by Rolando Brondolin, PhD student in System Architecture at Politecnico di Milano
Blue Waters and Resource Management - Now and in the Futureinside-BigData.com
In this presentation from Moabcon 2013, Bill Kramer from NCSA presents: Blue Waters and Resource Management - Now and in the Future.
Watch the video of this presentation: http://insidehpc.com/?p=36343
Transfer Learning for Performance Analysis of Configurable Systems:A Causal ...Pooyan Jamshidi
Modern systems (e.g., deep neural networks, big data analytics, and compilers) are highly configurable, which means they expose different performance behavior under different configurations. The fundamental challenge is that one cannot simply measure all configurations due to the sheer size of the configuration space. Transfer learning has been used to reduce the measurement efforts by transferring knowledge about performance behavior of systems across environments. Previously, research has shown that statistical models are indeed transferable across environments. In this work, we investigate identifiability and transportability of causal effects and statistical relations in highly-configurable systems. Our causal analysis agrees with previous exploratory analysis~\cite{Jamshidi17} and confirms that the causal effects of configuration options can be carried over across environments with high confidence. We expect that the ability to carry over causal relations will enable effective performance analysis of highly-configurable systems.
With the rise of containerization, as well as the established adoption of virtualization technologies, run-time power and energy management is becoming one of the key challenges in modern cloud computing. This is also fundamental as power consumption contributes to the 20% of the Total Cost of Ownership of a datacenter and energy costs will exceed hardware costs in the near future. In this context, several goals towards power optimization can be achieved. On the one hand, power capping can be enforced and on top of that the system should be able to maximize performance. On the other hand, when performance are critical, the system should be able to provide a minimum SLA and optimize power consumption without violating it. Within this context, we propose a common autonomic methodology based on the ODA control loop for containers and virtual machines. The proposed methodology is able to achieve 25% power savings for containers and can improve performance under a power cap for virtual machines.
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
Support Vector Machines (SVMs) have proven to yield high accuracy and have been used widespread in recent years. However, the standard versions of the SVM algorithm are very time-consuming and computationally intensive; which places a challenge on engineers to explore other hardware architectures than CPU, capable of performing real-time training and classifications while maintaining low power consumption in embedded systems. This paper proposes an overview of works based on the two most popular parallel processing devices: GPU and FPGA, with a focus on multiclass training process. Since different techniques have been evaluated using different experimentation platforms and methodologies, we only focus on the improvements realized in each study.
This is a presentation by Prof. Anne Elster at the International Workshop on Open Source Supercomputing held in conjunction with the 2017 ISC High Performance Computing Conference.
VaMoS 2022 - Transfer Learning across Distinct Software SystemsLuc Lesoil
Many research studies predict the performance of configurable software using machine learning techniques, thus requiring large amounts of data. Transfer learning aims to reduce the amount of data needed to train these models and has been successfully applied on different executing environments (hardware) or software versions. In this paper we investigate for the first time the idea of applying transfer learning between distinct configurable systems. We design a study involving two video encoders (namely x264 and x265) coming from different code bases. Our results are encouraging since transfer learning outperforms traditional learning for two performance properties (out of three). We discuss the open challenges to overcome for a more general application.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2021/02/introduction-to-the-tvm-open-source-deep-learning-compiler-stack-a-presentation-from-octoml/
Luis Ceze, Co-founder and CEO of OctoML, a Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, and Venture Partner at Madrona Venture Group, presents the “Introduction to the TVM Open Source Deep Learning Compiler Stack” tutorial at the September 2020 Embedded Vision Summit.
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms — such as mobile phones, embedded devices, and accelerators — requires significant manual effort.
In this talk, Ceze presents his work on the TVM stack, which exposes graph- and operator-level optimizations to provide performance portability for deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of optimizations.
ACTRESS: Domain-Specific Modeling of Self-Adaptive Software ArchitecturesFilip Krikava
Presentation given at 29th Symposium On Applied Computing (SAC'14) - Dependable and Adaptive Distributed Systems track.
It is mainly based on the work done during my Ph.D.
El Barcelona Supercomputing Center (BSC) fue establecido en 2005 y alberga el MareNostrum, uno de los superordenadores más potentes de España. Somos el centro pionero de la supercomputación en España. Nuestra especialidad es la computación de altas prestaciones - también conocida como HPC o High Performance Computing- y nuestra misión es doble: ofrecer infraestructuras y servicio de supercomputación a los científicos españoles y europeos, y generar conocimiento y tecnología para transferirlos a la sociedad. Somos Centro de Excelencia Severo Ochoa, miembros de primer nivel de la infraestructura de investigación europea PRACE (Partnership for Advanced Computing in Europe), y gestionamos la Red Española de Supercomputación (RES). Como centro de investigación, contamos con más de 456 expertos de 45 países, organizados en cuatro grandes áreas de investigación: Ciencias de la computación, Ciencias de la vida, Ciencias de la tierra y aplicaciones computacionales en ciencia e ingeniería.
Self-adaptive container monitoring with performance-aware Load-Shedding policies, by Rolando Brondolin, PhD student in System Architecture at Politecnico di Milano
Blue Waters and Resource Management - Now and in the Futureinside-BigData.com
In this presentation from Moabcon 2013, Bill Kramer from NCSA presents: Blue Waters and Resource Management - Now and in the Future.
Watch the video of this presentation: http://insidehpc.com/?p=36343
Transfer Learning for Performance Analysis of Configurable Systems:A Causal ...Pooyan Jamshidi
Modern systems (e.g., deep neural networks, big data analytics, and compilers) are highly configurable, which means they expose different performance behavior under different configurations. The fundamental challenge is that one cannot simply measure all configurations due to the sheer size of the configuration space. Transfer learning has been used to reduce the measurement efforts by transferring knowledge about performance behavior of systems across environments. Previously, research has shown that statistical models are indeed transferable across environments. In this work, we investigate identifiability and transportability of causal effects and statistical relations in highly-configurable systems. Our causal analysis agrees with previous exploratory analysis~\cite{Jamshidi17} and confirms that the causal effects of configuration options can be carried over across environments with high confidence. We expect that the ability to carry over causal relations will enable effective performance analysis of highly-configurable systems.
With the rise of containerization, as well as the established adoption of virtualization technologies, run-time power and energy management is becoming one of the key challenges in modern cloud computing. This is also fundamental as power consumption contributes to the 20% of the Total Cost of Ownership of a datacenter and energy costs will exceed hardware costs in the near future. In this context, several goals towards power optimization can be achieved. On the one hand, power capping can be enforced and on top of that the system should be able to maximize performance. On the other hand, when performance are critical, the system should be able to provide a minimum SLA and optimize power consumption without violating it. Within this context, we propose a common autonomic methodology based on the ODA control loop for containers and virtual machines. The proposed methodology is able to achieve 25% power savings for containers and can improve performance under a power cap for virtual machines.
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
Support Vector Machines (SVMs) have proven to yield high accuracy and have been used widespread in recent years. However, the standard versions of the SVM algorithm are very time-consuming and computationally intensive; which places a challenge on engineers to explore other hardware architectures than CPU, capable of performing real-time training and classifications while maintaining low power consumption in embedded systems. This paper proposes an overview of works based on the two most popular parallel processing devices: GPU and FPGA, with a focus on multiclass training process. Since different techniques have been evaluated using different experimentation platforms and methodologies, we only focus on the improvements realized in each study.
This is a presentation by Prof. Anne Elster at the International Workshop on Open Source Supercomputing held in conjunction with the 2017 ISC High Performance Computing Conference.
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...inside-BigData.com
In this deck from PASC18, Robert Searles from the University of Delaware presents: Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures.
"Architectures are rapidly evolving, and exascale machines are expected to offer billion-way concurrency. We need to rethink algorithms, languages and programming models among other components in order to migrate large scale applications and explore parallelism on these machines. Although directive-based programming models allow programmers to worry less about programming and more about science, expressing complex parallel patterns in these models can be a daunting task especially when the goal is to match the performance that the hardware platforms can offer. One such pattern is wavefront. This paper extensively studies a wavefront-based miniapplication for Denovo, a production code for nuclear reactor modeling.
We parallelize the Koch-Baker-Alcouffe (KBA) parallel-wavefront sweep algorithm in the main kernel of Minisweep (the miniapplication) using CUDA, OpenMP and OpenACC. Our OpenACC implementation running on NVIDIA's next-generation Volta GPU boasts an 85.06x speedup over serial code, which is larger than CUDA's 83.72x speedup over the same serial implementation. Our experimental platform includes SummitDev, an ORNL representative architecture of the upcoming Summit supercomputer. Our parallelization effort across platforms also motivated us to define an abstract parallelism model that is architecture independent, with a goal of creating software abstractions that can be used by applications employing the wavefront sweep motif."
Watch the video: https://wp.me/p3RLHQ-iPU
Read the Full Paper: https://doi.org/10.1145/3218176.3218228
and
https://pasc18.pasc-conference.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Simulating Heterogeneous Resources in CloudLightningCloudLightning
In this presentation, Dr Christos Papadopoulos-Filelis (Democritus University of Thrace, Greece) discusses resource characterisation, simulation tools and the elements of simulation used in CloudLightning.
This presentation was given at the National Conference on Cloud Computing in Dublin City University on 12th April 2016.
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
Multi-objective evolutionary algorithms (MOEAs) help software engineers find novel solutions to complex problems. When automatic tools explore too many options, they are slow to use and hard to comprehend. GALE is a near-linear time MOEA that builds a piecewise approximation to the surface of best solutions along the Pareto frontier. For each piece, GALE mutates solutions towards the better end. In numerous case studies, GALE finds comparable solutions to standard methods (NSGA-II, SPEA2) using far fewer evaluations (e.g. 20 evaluations, not 1,000). GALE is recommended when a model is expensive to evaluate, or when some audience needs to browse and understand how an MOEA has made its conclusions.
Application Profiling at the HPCAC High Performance Centerinside-BigData.com
Pak Lui from the HPC Advisory Council presented this deck at the 2017 Stanford HPC Conference.
"To achieve good scalability performance on the HPC scientific applications typically involves good understanding of the workload though performing profile analysis, and comparing behaviors of using different hardware which pinpoint bottlenecks in different areas of the HPC cluster. In this session, a selection of HPC applications will be shown to demonstrate various methods of profiling and analysis to determine the bottleneck, and the effectiveness of the tuning to improve on the application performance from tests conducted at the HPC Advisory Council High Performance Center."
Watch the video presentation: http://wp.me/p3RLHQ-gpY
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Exploring performance and energy consumption differences between recent Intel...Unai Lopez-Novoa
Slides for the talk "Exploring performance and energy consumption differences between recent Intel processors", presented at the Workshop on High Performance Big Data Computing (WHPBDC 2019) in Leicester, UK.
A Platform for Overcrowding Detection in Indoor Events using Scalable Technol...Unai Lopez-Novoa
Slides used for to present the paper "A Platform for Overcrowding Detection in Indoor Events using Scalable Technologies"
Presented at the 13th IEEE International Conference on Ubiquitous Intelligence and Computing (UIC), 18th July 2016, Toulouse, France
Charla realizada en el Master de Sistemas Informáticos Avanzados, UPV/EHU, 12/11/2014.
Presentación del hardware utilizado en la actualidad para cómputo masivo en clusters, librerias para programación HPC y varios retos en investigación en el área del paralelismo.
Courier management system project report.pdfKamal Acharya
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Vaccine management system project report documentation..pdfKamal Acharya
The Division of Vaccine and Immunization is facing increasing difficulty monitoring vaccines and other commodities distribution once they have been distributed from the national stores. With the introduction of new vaccines, more challenges have been anticipated with this additions posing serious threat to the already over strained vaccine supply chain system in Kenya.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Automobile Management System Project Report.pdfKamal Acharya
The proposed project is developed to manage the automobile in the automobile dealer company. The main module in this project is login, automobile management, customer management, sales, complaints and reports. The first module is the login. The automobile showroom owner should login to the project for usage. The username and password are verified and if it is correct, next form opens. If the username and password are not correct, it shows the error message.
When a customer search for a automobile, if the automobile is available, they will be taken to a page that shows the details of the automobile including automobile name, automobile ID, quantity, price etc. “Automobile Management System” is useful for maintaining automobiles, customers effectively and hence helps for establishing good relation between customer and automobile organization. It contains various customized modules for effectively maintaining automobiles and stock information accurately and safely.
When the automobile is sold to the customer, stock will be reduced automatically. When a new purchase is made, stock will be increased automatically. While selecting automobiles for sale, the proposed software will automatically check for total number of available stock of that particular item, if the total stock of that particular item is less than 5, software will notify the user to purchase the particular item.
Also when the user tries to sale items which are not in stock, the system will prompt the user that the stock is not enough. Customers of this system can search for a automobile; can purchase a automobile easily by selecting fast. On the other hand the stock of automobiles can be maintained perfectly by the automobile shop manager overcoming the drawbacks of existing system.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]
1. Unai Lopez Novoa
19 June 2015
Phd Dissertation
Advisors: Jose Miguel-Alonso & Alexander Mendiburu
Contributions to the Efficient Use of
General Purpose Coprocessors:
Kernel Density Estimation as Case Study
2. Outline
• Introduction
• Contributions
1) A Survey of Performance Modeling and Simulation Techniques
2) S-KDE: An Efficient Algorithm for Kernel Density Estimation
• And its implementation for Multi and Many-cores
1) Implementation of S-KDE in General Purpose Coprocessors
2) A Methodology for Environmental Model Evaluation based on S-KDE
• Conclusions
2
4. High Performance Computing
• Branch of computer science related to the use of parallel
architectures to solve complex computational problems
• Today’s fastest supercomputer: Tianhe-2
4Introduction
(China’s National University of Defense and Technology, 33.86 PFLOP/s)
5. HPC Environments
• Traditional HPC systems were homogeneous, built
around single or multi-core CPUs
• But supercomputers are becoming heterogeneous
5Introduction
(Coprocessor number evolution in the Top500 list over time)
6. Compute platforms
Introduction 6
Multi-Core
CPUs
• Branch prediction, OoOE
• “Versatile”
Up to
250 GFLOP/s
Graphics
Processing
Units
• Hundreds of cores
• Handle thousands of threads Up to
1.8 TFLOP/s
Many-Core
Processors
• Tens of x86 cores
• HyperThreading
Up to
1 TFLOP/s
Device Features Peak D.P. Performance
7. Motivation
• Examples of successful porting of applications to
accelerators (compared againts multi-core implementations):
• SAXPY: 11.8x
• Polynomial Equation Solver: 79x
• Image Treatment (MRI): 263x
• …
• … but this is not applicable for every HPC code
Introduction 7
Ryoo, Shane, et al. "Optimization principles and application performance evaluation of a
multithreaded GPU using CUDA." Proceedings of the 13th ACM SIGPLAN Symposium on Principles
and practice of parallel programming. ACM, 2008. (>700 cites on Google Scholar)
8. Difficulties using accelerators
• Suitable codes for accelerators should:
• Expose high levels of parallelism
• Have a good spatial/temporal data locality
• …
• Porting a code requires extensive program rewriting
• Development tools for accelerators are not as polished
as those for CPUs
Effectively exploiting the performance of a
coprocessor remains as a challenging task
Introduction 8
9. Structure of this thesis
Introduction 9
A survey of performance modeling
and simulation techniques
Design of a novel algorithm for
Kernel Density Estimation: S-KDE
S-KDE for Multi & Many-Cores
S-KDE for Accelerators
A methodology for environmental
model evaluation based on S-KDE
Motivation:
Discuss the issues to efficiently use
general purpose coprocessors
Case Study:
Kernel Density Estimation applied
to environmental model evaluation
10. A Survey of Performance Modeling
and Simulation Techniques
10
11. Developing for accelerators
• Approaches/aids:
A Survey of Performance Modeling and Simulation Techniques 11
Trial and error
Profilers / Debuggers / …
Performance Models
12. A survey of models and simulators
• Accelerator & GPGPU trend began ~2005
• First performance models appeared ~2007
• Abundant literature
• No outstanding models or tools
A Survey of Performance Modeling and Simulation Techniques 12
13. Taxonomy
A Survey of Performance Modeling and Simulation Techniques 13
Execution time
estimation
Bottleneck
highlighting
Power cons.
estimation
Simulators
14. Model analysis
• We analysed 29 relevant accelerator models
• For each of them we summarized and identified:
• Modeling method (Analytical, Machine Learning,…)
• Target platforms and test devices
• Input preprocessing requirements
• Limitations
• Highlights over other models
A Survey of Performance Modeling and Simulation Techniques 14
15. The MWP-CWP model
• Presented by Hong & Kim in 2009 (>360 cites in Google Scholar)
• Estimates the execution time of a GPU application
• Based on how Warps are scheduled in NVIDIA GPUs
A Survey of Performance Modeling and Simulation Techniques 15
Test platform Input requirements Limitations HighlightsMethod
Analytical NVIDIA GPUs
(8800GT,…)
Run µbenchmarks
& Parse PTX
Branches not
modeled
Extendable to non-
NVIDIA GPUs
16. The Roofline model
• Presented by Williams et al. in 2009 (>450 cites in Google Scholar)
• Outstanding model for bottleneck highlighting
• Visual model:
A Survey of Performance Modeling and Simulation Techniques 16
Test platform Input requirements Limitations HighlightsMethod
Analytical Multi-core CPUs
& Accelerators
Run µbenchmarks
& Analyse application
Depends on
architecture
Visual output to guide
optimizations
17. Performance tools
• Some models require running performance tools
(µbenchmarks, profilers,…)
• We have reviewed them as well
A Survey of Performance Modeling and Simulation Techniques 17
18. Conclusions
1) There is no accurate model valid for a wide set of
architectures
2) Most models are tied to CUDA
3) There is a growing interest in analyzing power
4) It was impossible to make a comparison of the models
(lack of details, codes, …)
A Survey of Performance Modeling and Simulation Techniques 18
19. S-KDE: An Efficient Algorithm for
Kernel Density Estimation
(and its implementation for Multi and Many-cores)
19
20. Case study
• Collaborative Work:
EOLO
UPV/EHU Climate and Meteorology Group
• Scenario:
Environmental Model Evaluation
• Problem:
Excessive execution times of KDE
S-KDE: An Efficient Algorithm for Kernel Density Estimation 20
21. Kernel Density Estimation
• Statistical technique used to estimate the Probability
Density Function (PDF) of a random variable with
unknown characteristics
• where:
• xi are the samples from the random variables
• K is the Kernel function
• H is the bandwidth value
S-KDE: An Efficient Algorithm for Kernel Density Estimation 21
22. Kernel function
• Symmetric function that integrates to one
• We classify them according to area of influence
S-KDE: An Efficient Algorithm for Kernel Density Estimation 22
0
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 3
Density
x
Gaussian
0
0.2
0.4
0.6
0.8
1
1.2
-1 -0.5 0 0.5 1
Density
x
Epanechnikov
Bounded Unbounded
23. Bandwidth
• Parameter to control the smoothness of the estimation
• It must be carefully selected
• Common approaches for its selection
• Heuristic as in Silverman, 1986
• Iterative technique, e.g., bootstraping
S-KDE: An Efficient Algorithm for Kernel Density Estimation 23
24. Computing KDE
S-KDE: An Efficient Algorithm for Kernel Density Estimation 24
Naive approach: EP-KDE
for each eval_point e in E
for each sample s in S
d = distance(e,s)
e += density (d)
Our proposal: S-KDE
for each sample s in S
B = findInfluenceArea(s)
for each eval_point e in B
d = distance(e,s)
e += density (d)
Complexity: O(|E|·|S|) Complexity: O(|B|·|S|)
25. Delimiting the influence area
S-KDE: An Efficient Algorithm for Kernel Density Estimation 25
• Depends on the Kernel
• Our case: Epanechnikov kernel
• Technique based on a method in Fukunaga, 1990
26. Chop & Crop
• In spaces of dimensionality 3 and higher, the number of
evaluation points outside the influence area increases
• We developed a technique to further reduce evaluations:
S-KDE: An Efficient Algorithm for Kernel Density Estimation 26
Step 1: Chop the box into slices Step 2: Crop the slice
27. Example numbers
500k Samples 3D dataset
194M Evaluation point space
EP-KDE: 9.74 * 1013
distance-density evaluations
102461 Evaluation points per Bounding box (average)
S-KDE: 5.12 * 1010
evaluations
With C&C: 53511 Evaluation point per Bounding box (average)
S-KDE + C&C: 2.67 * 1010
evaluations
S-KDE: An Efficient Algorithm for Kernel Density Estimation 27
28. S-KDE in OpenMP
S-KDE: An Efficient Algorithm for Kernel Density Estimation 28
Initialization
Distribute samples
to threads
Fit bounding box
Chop into slices
Crop and compute
density
Accumulate density
to evaluation space
#pragma omp for
#pragma simd
#pragma atomic
29. S-KDE in OpenMP
• Targeting Multi and Many core processors
• Tested platforms:
• Intel i7 Intel Core i7 3820 CPU (4 Cores @ 3.6 GHz)
• Intel Xeon Phi 3120A (57 Cores @ 1.1 GHz, Native mode)
• Public KDE implementations used as yardsticks:
• Ks-kde (R Package)
• GPUML
• Several Python libraries
S-KDE: An Efficient Algorithm for Kernel Density Estimation 29
31. Conclusions
1) S-KDE + Chop & Crop reduces KDE complexity
2) Native, parallel implementation for Multi and Many-
core processors
• OpenMP
1) We beat state-of-the-art alternatives
S-KDE: An Efficient Algorithm for Kernel Density Estimation 31
33. S-KDE in OpenCL
Implementation of S-KDE in General Purpose Coprocessors 33
Initialization
...8 10 11 12 12
...0 8 18 29 41
Fit box & Chop
Crop
Offset calculation
Density computation
Density transfer
Density accumulation
(1)
(2)
(3)
(4)
(5)
(6)
(7)
• Host code
• Accelerator code
35. Conclusions
1) OpenCL version of S-KDE provides good overall
performance
2) The consolidation stage is the main bottleneck
3) The code is close to the limits of the accelerators
4) Further performance gains using pipelined execution
Implementation of S-KDE in General Purpose Coprocessors 35
37. Climate models
• Mathematical representations of a climate system, based
on physical, chemical and biological principles
• They predict a trend in a long term time
• Recently used to asses the impact of greenhouse gases
A Methodology for Environmental Model Evaluation based on S-KDE 37
38. Climate model evaluation
• Models must be validated against actual observations
• There is not a universally accepted validation strategy
• Popular approaches:
• Averaged values per estimated variable
• Evaluating the per-variable Probability Density Functions (PDFs)
A Methodology for Environmental Model Evaluation based on S-KDE 38
39. PDF-based model evaluation
• Current approaches:
1) Compute the PDF per estimated variable
2) Calculate similarity score per-variable against observations
3) Combine the scores to get global performance of the model
• Lack of a universally accepted way to combine the scores
• Our proposal:
• An extension of the score by [1] to multiple dimensions
• A methodology to evaluate multiple variables in a single step
A Methodology for Environmental Model Evaluation based on S-KDE 39
[1]: Perkins, S. E., et al. "Evaluation of the AR4 climate models' simulated daily maximum
temperature, minimum temperature, and precipitation over Australia using probability density
functions." Journal of climate 20.17 (2007): 4356-4376.
40. Methodology
A Methodology for Environmental Model Evaluation based on S-KDE 40
1) Estimate optimal
bandwidth
Iterative use
of KDE
h = 0.6 h = 0.65
Estimations
MIROCS3.2-MR Model
Observations
3) Compute score S = 0.74
2) Compute PDF
with opt. bandwidth
Single use
of KDE
PDF (Estimations)
h = 0.6
PDF (Observations)
h = 0.65
PDF (Observations)
h = 0.65
41. Evaluation
• Models: 7 from CMIP3 experiment (with different configurations)
• Dataset: 20C3M (1961 to 1998 on a daily basis)
• Variables:
• Global average of surface temperature
• Difference in temperature between N and S hemispheres
• Difference in temperature between Equator and the poles
• Scores for the models:
A Methodology for Environmental Model Evaluation based on S-KDE 41
NCEP MIROCS
3.2MR-2
MIROCS
3.2MR-3
HADGE
M1
MIROCS
3.2MR-1
MIROCS
3.2-HR
GFDL-
CM2.1
GFDL-
CM2.0
BCM2.0 ECHAM
5
MRI-
RUN03
MRI-
RUN04
MRI-
RUN01
MRI-
RUN02
MRI-
RUN05
0,82 0,74 0,73 0,71 0,7 0,67 0,62 0,6 0,51 0,48 0,3 0,29 0,29 0,29 0,28
42. A Methodology for Environmental Model Evaluation based on S-KDE 42
MIROC3.2-MR-RUN02
Score = 0,74
MRI-RUN01
Score = 0,29
Evaluation
Surface: Observations
Contour: Model
C0: Global average surface temperature
C1: Difference in temperature between Hemispheres
MIROC3
C2(K)
C1(K)
43. Conclusions
1)We have presented a methodology based on the
extension to multiple dimensions of the index by Perkins et
al.
2)It allows evaluating multiple variables of an environmental
model in a single step
3)It is feasible in time thanks to the use of a fast
implementation of KDE: S-KDE
A Methodology for Environmental Model Evaluation based on S-KDE 43
45. Summary of contributions
•We have conducted an extensive survey on performance
models using a proposed taxonomy
•We have designed S-KDE, a technique that reduces the
complexity of Kernel Density Estimation computations
•We have implemented S-KDE for Multi and Many-cores
using OpenMP
• Outperforming the state-of-the-art parallel codes for KDE
Conclusions 45
46. Summary of contributions
• We have presented an OpenCL implementation of S-KDE
for general purpose coprocessors.
• It reaches the limits of the devices and acceptable performance,
but requires further work
• We have designed of a methodology for environmental
model evaluation based on KDE, that allows to evaluate
multiple variables from a model accurately in a simple
way
• S-KDE is a key, enabling element
Conclusions 46
47. Future work
• We intend to develop a methodology for the performance
evaluation accelerator-based applications, based on the
survey presented as first contribution
• We need to improve S-KDE in both multi-cores and
coprocessors
• In particular, the consolidation stage
• We intend to design a technique to analyse new climate
data from the CMIP Project, with dimensionality up to ten
Conclusions 47
48. Publications
Conclusions 48
Unai Lopez-Novoa, Alexander Mendiburu, and Jose Miguel-Alonso.
A survey of performance modeling and simulation techniques for
accelerator-based computing. IEEE Transactions on Parallel and
Distributed Systems, 26(1):272–281, Jan 2015
Unai Lopez-Novoa, Jon Sáenz, Alexander Mendiburu, and Jose
Miguel-Alonso. An efficient implementation of kernel density
estimation for multi-core & many-core architectures. International
Journal of High Performance Computing Applications, Accepted,
2015, DOI: 10.1177/1094342015576813
49. Publications
Conclusions 49
Unai Lopez-Novoa, Alexander Mendiburu, and Jose Miguel-
Alonso. Kernel density estimation in accelerators: Implementation
and performance evaluation. Parallel Computing. To be
submitted.
Unai Lopez-Novoa, Jon Sáenz, Alexander Mendiburu, Jose
Miguel-Alonso, Iñigo Errasti, Ganix Esnaola, Agustín Ezcurra, and
Gabriel Ibarra-Berastegi. Multi-objective environmental model
evaluation by means of multidimensional kernel density
estimators: Efficient and multi-core implementations.
Environmental Modelling & Software, 63:123 – 136, 2015
50. Unai Lopez Novoa
19 June 2015
Phd Dissertation
Advisors: Jose Miguel-Alonso & Alexander Mendiburu
Contributions to the Efficient Use of
General Purpose Coprocessors:
Kernel Density Estimation as Case Study