FPGAs can compete with GPUs for some applications but with some key differences:
1) FPGAs are configured to create custom hardware for an algorithm rather than using predefined hardware like GPUs. This allows high efficiency but is more difficult to program.
2) While OpenCL provides a common language, FPGAs and GPUs have very different architectures and optimizing algorithms requires different approaches for each.
3) For applications with high bandwidth I/O or flexibility requirements, FPGAs may have advantages over GPUs, but GPUs typically have higher performance for compute-heavy applications and better energy efficiency. Overall, FPGAs have become more accessible but still require more programming effort than GPUs.
Parallel Application Performance Prediction of Using Analysis Based ModelingJason Liu
Parallel Application Performance Prediction Using Analysis Based Models and HPC Simulations, Mohammad Abu Obaida, Jason Liu, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2018 SIGSIM Principles of Advanced Discrete Simulation (SIGSIM-PADS’18), May 2018.
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be time consuming, and in an attempt to minimize this time, our project is a parallel implementation of K-Means clustering algorithm on CUDA using C. We present the performance analysis and implementation of our approach to parallelizing K-Means clustering.
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
Declare Your Language: Virtual Machines & Code GenerationEelco Visser
The document summarizes virtual machines and code generation. It discusses how high-level programming languages are abstracted from low-level machine details through virtual machines. The Java Virtual Machine architecture and bytecode instructions are described, including its stack-based design, threads, heap, and method area. Code generation mechanics like string operations are also covered.
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Unity Technologies
This session addresses how we are expanding the scope of the Burst Compiler to enable even the most demanding, hand-coded engine and gameplay problems to be expressed in HPC# via direct CPU intrinsics. Andreas shares the reasoning and use cases; as well as discussing implementation challenges, debugging, and performance along with comparisons to C++ code.
Speaker: Andreas Fredriksson - Unity
Watch the session on YouTube: https://youtu.be/BpwvXkoFcp8
spaGO: A self-contained ML & NLP library in GOMatteo Grella
Introduction to spaGO, a beautiful and maintainable machine learning library written in Go designed to support relevant neural network architectures in natural language processing tasks.
Github: https://github.com/nlpodyssey/spago
FPGAs can compete with GPUs for some applications but with some key differences:
1) FPGAs are configured to create custom hardware for an algorithm rather than using predefined hardware like GPUs. This allows high efficiency but is more difficult to program.
2) While OpenCL provides a common language, FPGAs and GPUs have very different architectures and optimizing algorithms requires different approaches for each.
3) For applications with high bandwidth I/O or flexibility requirements, FPGAs may have advantages over GPUs, but GPUs typically have higher performance for compute-heavy applications and better energy efficiency. Overall, FPGAs have become more accessible but still require more programming effort than GPUs.
Parallel Application Performance Prediction of Using Analysis Based ModelingJason Liu
Parallel Application Performance Prediction Using Analysis Based Models and HPC Simulations, Mohammad Abu Obaida, Jason Liu, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2018 SIGSIM Principles of Advanced Discrete Simulation (SIGSIM-PADS’18), May 2018.
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be time consuming, and in an attempt to minimize this time, our project is a parallel implementation of K-Means clustering algorithm on CUDA using C. We present the performance analysis and implementation of our approach to parallelizing K-Means clustering.
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
Declare Your Language: Virtual Machines & Code GenerationEelco Visser
The document summarizes virtual machines and code generation. It discusses how high-level programming languages are abstracted from low-level machine details through virtual machines. The Java Virtual Machine architecture and bytecode instructions are described, including its stack-based design, threads, heap, and method area. Code generation mechanics like string operations are also covered.
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Unity Technologies
This session addresses how we are expanding the scope of the Burst Compiler to enable even the most demanding, hand-coded engine and gameplay problems to be expressed in HPC# via direct CPU intrinsics. Andreas shares the reasoning and use cases; as well as discussing implementation challenges, debugging, and performance along with comparisons to C++ code.
Speaker: Andreas Fredriksson - Unity
Watch the session on YouTube: https://youtu.be/BpwvXkoFcp8
spaGO: A self-contained ML & NLP library in GOMatteo Grella
Introduction to spaGO, a beautiful and maintainable machine learning library written in Go designed to support relevant neural network architectures in natural language processing tasks.
Github: https://github.com/nlpodyssey/spago
Есть много причин заниматься конверсией управляемых языков в нативные: это прежде всего производительность, но также защита от реверс-инжиниринга, поддержка аппаратных технологий или каких-то специфичных платформ. В этом докладе мы посмотрим на пример построения конвертера из C# в C++ и те нюансы, которые встречаются при решении этой задачи
Introduction of Chainer, a framework for neural networks, v1.11. Slides used for the student seminar on July 20, 2016, at Sugiyama-Sato lab in the Univ. of Tokyo.
С появлением новых Стандартов C++ становится заметным оживление интереса разработчиков к языку. Интерес также обеспечивается возможностями, которые несут нам новые архитектуры и технологии. Нужда в эффективных параллельных алгоритмах и приложениях влечет за собой нужду в библиотеках и фреймворках, способных правильно адаптировать и масштабировать нагрузку по всей вычислительной системе. В докладе Антон расскажет о решающей данные проблемы библиотеке HPX, основанной на новой модели исполнения ParalleX, имеющей совместимый интерфейс с C++11/14/17 thread библиотекой, и сильно расширяющей ее по различным векторам. Незатронутыми не останутся вкусности параллельного мира из C++17 - технические спецификации Concurrency и Parallelism!
This document summarizes a reconfigurable system with Linux load on an FPGA. It discusses the software architecture including socket communication, device communication, and architectural layers/details. It then describes the reconfiguration of the FPGA with Linux, including module loading/unloading. Performance results are provided for startup, driver setup, and module/data loading. Finally, future work is discussed around improving algorithms, adding Ethernet support, and a distributed scenario.
1) The document presents various algorithms for efficiently transposing matrices while minimizing memory accesses and cache misses.
2) It analyzes the algorithms under different memory models: RAM, I/O, cache, and cache-oblivious. The block transpose, half/full copying, and Morton layout algorithms improve performance by reusing data blocks.
3) Experimental results on a 300MHz system show the Morton layout and half copying algorithms have the fastest runtimes due to minimizing data references, L1 misses, and TLB misses. The relative performance of algorithms depends on cache miss latency.
The document discusses three sanitizers - AddressSanitizer, ThreadSanitizer, and MemorySanitizer - that detect bugs in C/C++ programs. AddressSanitizer detects memory errors like buffer overflows and use-after-frees. ThreadSanitizer finds data races between threads. MemorySanitizer identifies uses of uninitialized memory. The sanitizers work by instrumenting code at compile-time and providing a run-time library for error detection and reporting. They have found thousands of bugs in major software projects with reasonable overhead. Future work includes supporting more platforms and detecting additional classes of bugs.
1) Template metaprogramming allows performing computations at compile time using templates.
2) In 1994, Erwin Unruh discovered template metaprogramming accidentally when his program calculated the first 30 prime numbers as part of a compiler error message.
3) Template metaprogramming is Turing complete and can be used to implement recursive functions and algorithms that execute at compile time rather than run time.
HPX: C++11 runtime система для параллельных и распределённых вычисленийPlatonov Sergey
The document discusses HPX, a C++ runtime system for parallel and distributed computing. It provides asynchronous and remote operations through futures. Futures allow transparent synchronization between producers and consumers of asynchronous operations. The document provides examples of using futures to parallelize recursive filters by futurizing the algorithms. This allows overlapping computation and hiding latencies. Futures can also be used to execute actions on remote localities in a distributed system.
Hadoop classes in mumbai
best android classes in mumbai with job assistance.
our features are:
expert guidance by it industry professionals
lowest fees of 5000
practical exposure to handle projects
well equiped lab
after course resume writing guidance
The slides from my parallel programming talk at LCA 2011. It is an overview of several languages that offer parallel programming paradigms with a strong bias towards functional programmin
This is a survey on HPCS languages, i.e. Chapel, X10, and Fortress comparing their idioms that support parallel programming. Paper on this is available at http://grids.ucs.indiana.edu/ptliupages/publications/Survey_on_HPCS_Languages_formatted_v2.pdf
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
DUSK - Develop at Userland Install into KernelAlexey Smirnov
DUSK is a framework that allows kernel modules to be developed at the user level by compiling them into a user-level program while still maintaining the performance of running in the kernel. It uses helper functions to connect the user-level component to the kernel-level component, allowing things like debugging and testing to be done at the user level. DUSK supports Netfilter modules initially and aims to provide an easier development process for kernel modules.
This document discusses coding style guidelines for logic synthesis. It begins with basic concepts of logic synthesis such as converting a high-level design to a gate-level representation using a standard cell library. It then discusses synthesizable Verilog constructs and coding techniques to improve synthesis like using non-blocking assignments in sequential logic blocks. The document also provides guidelines for coding constructs like if-else statements, case statements, always blocks and loops to make the design easily synthesizable. Memory synthesis approaches and techniques for designing clocks and resets are also covered.
The document discusses using GCC's auto-vectorizer to optimize loops. It provides flags and options for enabling vectorization, checking which loops were vectorized, and tips for writing vectorizable code. Examples are given of vectorized NEON code for improved performance. The Linaro Toolchain group works on vectorization and related optimizations, and examples from users can help with vectorization efforts.
This document discusses dynamic memory allocation in C using four library functions: malloc(), calloc(), realloc(), and free(). It explains that malloc() allocates memory and returns a pointer, calloc() allocates memory and initializes it to zero, realloc() changes the size of previously allocated memory, and free() frees memory allocated by the other functions. Code examples are provided to illustrate usage of each function.
This document introduces Mahout Scala and Spark bindings, which aim to provide an R-like environment for machine learning on Spark. The bindings define algebraic expressions for distributed linear algebra using Spark and provide optimizations. They define data types for scalars, vectors, matrices and distributed row matrices. Features include common linear algebra operations, decompositions, construction/collection functions, HDFS persistence, and optimization strategies. The goal is a high-level semantic environment that can run interactively on Spark.
The document discusses techniques for optimizing memory usage through bit packing and value type polymorphism. It describes:
1. Bit packing techniques like storing multiple values in a single integer using bitwise operations to reduce memory usage. This includes examples of packing booleans and enums.
2. Using a "tagged union" approach to represent different value types polymorphically by storing a type tag and common data in a single value.
3. The concept of "value type polymorphism" where subtypes all fit within a size budget by using a tag to differentiate them while presenting a common API. This allows efficiently representing types in a compiler.
This document summarizes a lecture on using GPU compute languages for advanced graphics processing beyond traditional programmable shading. It discusses using GPU compute APIs for tasks like building histograms, deferred rendering, and custom graphics pipelines. It provides definitions for key concepts in GPU execution like tasks, parallelism, and synchronization. Examples are given of using compute shaders for building histograms from pixel data and implementing a tiled particle rasterization pipeline. Optimizations like processing multiple tiles per workgroup are discussed to improve performance.
The document provides an overview of sanitizers, which are dynamic testing tools that detect bugs like buffer overflows and uninitialized memory reads. It focuses on Address Sanitizer (ASan), which detects invalid address usage bugs, and Undefined Behavior Sanitizer (UBSan), which finds unspecified code semantic bugs. ASan works by dividing memory into main and shadow spaces and instruments code to check shadow values for poisoning. UBSan detects issues like integer overflow and out-of-bounds memory access. Both tools are compiler-instrumented to add checks and generate detailed reports of encountered bugs.
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
This document discusses common technical problems encountered when building products with Spark and provides solutions. It covers Spark exceptions like out of memory errors and shuffle file problems. It recommends increasing partitions and memory configurations. The document also discusses optimizing Spark code using functional programming principles like strong and weak pipelining, and leveraging monoid structures to reduce shuffling. Overall it provides tips to debug issues, optimize performance, and productize Spark applications.
Есть много причин заниматься конверсией управляемых языков в нативные: это прежде всего производительность, но также защита от реверс-инжиниринга, поддержка аппаратных технологий или каких-то специфичных платформ. В этом докладе мы посмотрим на пример построения конвертера из C# в C++ и те нюансы, которые встречаются при решении этой задачи
Introduction of Chainer, a framework for neural networks, v1.11. Slides used for the student seminar on July 20, 2016, at Sugiyama-Sato lab in the Univ. of Tokyo.
С появлением новых Стандартов C++ становится заметным оживление интереса разработчиков к языку. Интерес также обеспечивается возможностями, которые несут нам новые архитектуры и технологии. Нужда в эффективных параллельных алгоритмах и приложениях влечет за собой нужду в библиотеках и фреймворках, способных правильно адаптировать и масштабировать нагрузку по всей вычислительной системе. В докладе Антон расскажет о решающей данные проблемы библиотеке HPX, основанной на новой модели исполнения ParalleX, имеющей совместимый интерфейс с C++11/14/17 thread библиотекой, и сильно расширяющей ее по различным векторам. Незатронутыми не останутся вкусности параллельного мира из C++17 - технические спецификации Concurrency и Parallelism!
This document summarizes a reconfigurable system with Linux load on an FPGA. It discusses the software architecture including socket communication, device communication, and architectural layers/details. It then describes the reconfiguration of the FPGA with Linux, including module loading/unloading. Performance results are provided for startup, driver setup, and module/data loading. Finally, future work is discussed around improving algorithms, adding Ethernet support, and a distributed scenario.
1) The document presents various algorithms for efficiently transposing matrices while minimizing memory accesses and cache misses.
2) It analyzes the algorithms under different memory models: RAM, I/O, cache, and cache-oblivious. The block transpose, half/full copying, and Morton layout algorithms improve performance by reusing data blocks.
3) Experimental results on a 300MHz system show the Morton layout and half copying algorithms have the fastest runtimes due to minimizing data references, L1 misses, and TLB misses. The relative performance of algorithms depends on cache miss latency.
The document discusses three sanitizers - AddressSanitizer, ThreadSanitizer, and MemorySanitizer - that detect bugs in C/C++ programs. AddressSanitizer detects memory errors like buffer overflows and use-after-frees. ThreadSanitizer finds data races between threads. MemorySanitizer identifies uses of uninitialized memory. The sanitizers work by instrumenting code at compile-time and providing a run-time library for error detection and reporting. They have found thousands of bugs in major software projects with reasonable overhead. Future work includes supporting more platforms and detecting additional classes of bugs.
1) Template metaprogramming allows performing computations at compile time using templates.
2) In 1994, Erwin Unruh discovered template metaprogramming accidentally when his program calculated the first 30 prime numbers as part of a compiler error message.
3) Template metaprogramming is Turing complete and can be used to implement recursive functions and algorithms that execute at compile time rather than run time.
HPX: C++11 runtime система для параллельных и распределённых вычисленийPlatonov Sergey
The document discusses HPX, a C++ runtime system for parallel and distributed computing. It provides asynchronous and remote operations through futures. Futures allow transparent synchronization between producers and consumers of asynchronous operations. The document provides examples of using futures to parallelize recursive filters by futurizing the algorithms. This allows overlapping computation and hiding latencies. Futures can also be used to execute actions on remote localities in a distributed system.
Hadoop classes in mumbai
best android classes in mumbai with job assistance.
our features are:
expert guidance by it industry professionals
lowest fees of 5000
practical exposure to handle projects
well equiped lab
after course resume writing guidance
The slides from my parallel programming talk at LCA 2011. It is an overview of several languages that offer parallel programming paradigms with a strong bias towards functional programmin
This is a survey on HPCS languages, i.e. Chapel, X10, and Fortress comparing their idioms that support parallel programming. Paper on this is available at http://grids.ucs.indiana.edu/ptliupages/publications/Survey_on_HPCS_Languages_formatted_v2.pdf
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
DUSK - Develop at Userland Install into KernelAlexey Smirnov
DUSK is a framework that allows kernel modules to be developed at the user level by compiling them into a user-level program while still maintaining the performance of running in the kernel. It uses helper functions to connect the user-level component to the kernel-level component, allowing things like debugging and testing to be done at the user level. DUSK supports Netfilter modules initially and aims to provide an easier development process for kernel modules.
This document discusses coding style guidelines for logic synthesis. It begins with basic concepts of logic synthesis such as converting a high-level design to a gate-level representation using a standard cell library. It then discusses synthesizable Verilog constructs and coding techniques to improve synthesis like using non-blocking assignments in sequential logic blocks. The document also provides guidelines for coding constructs like if-else statements, case statements, always blocks and loops to make the design easily synthesizable. Memory synthesis approaches and techniques for designing clocks and resets are also covered.
The document discusses using GCC's auto-vectorizer to optimize loops. It provides flags and options for enabling vectorization, checking which loops were vectorized, and tips for writing vectorizable code. Examples are given of vectorized NEON code for improved performance. The Linaro Toolchain group works on vectorization and related optimizations, and examples from users can help with vectorization efforts.
This document discusses dynamic memory allocation in C using four library functions: malloc(), calloc(), realloc(), and free(). It explains that malloc() allocates memory and returns a pointer, calloc() allocates memory and initializes it to zero, realloc() changes the size of previously allocated memory, and free() frees memory allocated by the other functions. Code examples are provided to illustrate usage of each function.
This document introduces Mahout Scala and Spark bindings, which aim to provide an R-like environment for machine learning on Spark. The bindings define algebraic expressions for distributed linear algebra using Spark and provide optimizations. They define data types for scalars, vectors, matrices and distributed row matrices. Features include common linear algebra operations, decompositions, construction/collection functions, HDFS persistence, and optimization strategies. The goal is a high-level semantic environment that can run interactively on Spark.
The document discusses techniques for optimizing memory usage through bit packing and value type polymorphism. It describes:
1. Bit packing techniques like storing multiple values in a single integer using bitwise operations to reduce memory usage. This includes examples of packing booleans and enums.
2. Using a "tagged union" approach to represent different value types polymorphically by storing a type tag and common data in a single value.
3. The concept of "value type polymorphism" where subtypes all fit within a size budget by using a tag to differentiate them while presenting a common API. This allows efficiently representing types in a compiler.
This document summarizes a lecture on using GPU compute languages for advanced graphics processing beyond traditional programmable shading. It discusses using GPU compute APIs for tasks like building histograms, deferred rendering, and custom graphics pipelines. It provides definitions for key concepts in GPU execution like tasks, parallelism, and synchronization. Examples are given of using compute shaders for building histograms from pixel data and implementing a tiled particle rasterization pipeline. Optimizations like processing multiple tiles per workgroup are discussed to improve performance.
The document provides an overview of sanitizers, which are dynamic testing tools that detect bugs like buffer overflows and uninitialized memory reads. It focuses on Address Sanitizer (ASan), which detects invalid address usage bugs, and Undefined Behavior Sanitizer (UBSan), which finds unspecified code semantic bugs. ASan works by dividing memory into main and shadow spaces and instruments code to check shadow values for poisoning. UBSan detects issues like integer overflow and out-of-bounds memory access. Both tools are compiler-instrumented to add checks and generate detailed reports of encountered bugs.
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
This document discusses common technical problems encountered when building products with Spark and provides solutions. It covers Spark exceptions like out of memory errors and shuffle file problems. It recommends increasing partitions and memory configurations. The document also discusses optimizing Spark code using functional programming principles like strong and weak pipelining, and leveraging monoid structures to reduce shuffling. Overall it provides tips to debug issues, optimize performance, and productize Spark applications.
H2O Design and Infrastructure with Matt DowleSri Ambati
This document provides an overview of H2O, an open source machine learning platform that allows for distributed, in-memory analytics of large datasets. It discusses how H2O works, including how it uses a map-reduce style to parallelize machine learning algorithms across multiple nodes. The document demonstrates starting an 8-node H2O cluster on Amazon EC2 and importing a 23GB dataset in under a minute, significantly faster than with other tools. It also summarizes how H2O's distributed fork-join framework executes tasks across nodes and shares data through its distributed data structures.
Mark Wong
pg_proctab is a collection of PostgreSQL stored functions that provide access to the operating system process table using SQL. We'll show you which functions are available and where they collect the data, and give examples of their use to collect processor and I/O statistics on SQL queries.
pg_proctab: Accessing System Stats in PostgreSQLMark Wong
pg_proctab is a collection of PostgreSQL stored functions that provide access to the operating system process table using SQL. We'll show you which functions are available and where they collect the data, and give examples of their use to collect processor and I/O statistics on SQL queries.
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
The document discusses using CNTK (Microsoft Cognitive Toolkit) for natural language processing and deep learning within Spark pipelines. It provides information on mmlspark, which allows embedding CNTK models into Spark. It also discusses using CNTK to analyze data from GitHub commits and relate code changes to natural language comments through sequence-to-sequence models.
The lecture discusses manycore GPU architectures and programming using OpenMP and HOMP. It introduces OpenMP directives for offloading computation to accelerators and covers data mapping between the host and device. It also discusses HOMP for automated distribution of parallel loops and data across multiple accelerators to improve load balancing and performance. The document provides examples of using OpenMP target directives and data mapping for problems like AXPY and Jacobi iteration on a GPU. It evaluates performance of different loop scheduling algorithms in HOMP on a system with CPUs, GPUs and MICs.
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
Kiran Lonikar proposes extending Project Tungsten in Spark SQL to enable parallel execution of DataFrame operations on GPUs. The proposal involves refactoring DataFrames to use a columnar layout and generating OpenCL code for batched execution across columns. Initial results show speedups from GPU execution. Future work includes supporting multi-GPU execution and adapting additional systems like Impala that may be better suited than Spark for GPU integration.
Week1 Electronic System-level ESL Design and SystemC Begin敬倫 林
This document provides an introduction and overview of electronic system level (ESL) design using SystemC. It begins with background on ESL design basics, system on chip design flows, and SystemC. It then provides 3 examples of SystemC code: a counter, traffic light, and simple bus. The counter example shows a basic module with clocked process. The traffic light demonstrates a finite state machine. The bus example illustrates an interface, master/slave devices, and memory mapped components communicating over a bus. Overall, the document serves as an introductory tutorial for designing and modeling electronic systems using the SystemC language.
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
The document discusses optimization best practices for Cray XT systems. It covers choosing compilers and compiler flags, profiling and debugging codes at scale with hardware performance counters and CrayPAT tools, optimizing communication with MPI by using techniques like pre-posting receives and reducing collectives, and optimizing I/O. The document emphasizes testing optimizations on the number of nodes the application will actually run on.
Published on 11 may, 2018
Chainer is a deep learning framework which is flexible, intuitive, and powerful.
This slide introduces some unique features of Chainer and its additional packages such as ChainerMN (distributed learning), ChainerCV (computer vision), ChainerRL (reinforcement learning), Chainer Chemistry (biology and chemistry), and ChainerUI (visualization).
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Hybrid parallel programming uses both message passing (e.g. MPI) and shared memory parallelism (e.g. OpenMP). MPI is used to distribute work across multiple computers while OpenMP parallelizes work within each computer across multiple cores. This approach can improve performance over MPI-only for problems where communication between computers is expensive compared to synchronization within a computer. However, for matrix multiplication experiments, a hybrid MPI-OpenMP approach did not show better performance than MPI-only. Larger problem sizes or different algorithms may be needed to realize benefits of the hybrid approach.
This document provides an introduction and overview of MPI (Message Passing Interface). It discusses:
- MPI is a standard for message passing parallel programming that allows processes to communicate in distributed memory systems.
- MPI programs use function calls to perform all operations. Basic definitions are included in mpi.h header file.
- The basic model in MPI includes communicators, groups, and ranks to identify processes. MPI_COMM_WORLD identifies all processes.
- Sample MPI programs are provided to demonstrate point-to-point communication, collective communication, and matrix multiplication using multiple processes.
- Classification of common MPI functions like initialization, communication, and information queries are discussed.
Apache Spark 2.0 includes improvements that provide considerable speedups for CPU-intensive queries through techniques like code generation. Profiling tools like flame graphs can help analyze where CPU cycles are spent by visualizing stack traces. Flame graphs are useful for performance troubleshooting but have limitations. Testing Spark applications locally and through unit tests allows faster iteration compared to running on clusters and saves resources. It is also important to test with local approximations of distributed components like HDFS and Hive.
The document discusses distributed computing and the MapReduce programming model. It provides examples of how Folding@home and PS3s contribute significantly to distributed computing projects. It then explains challenges with inter-machine parallelism like communication overhead and load balancing. The document outlines Google's MapReduce model which handles these issues and makes programming distributed systems easier through its map and reduce functions.
The document provides security tips and best practices for building web applications in Go. It discusses Go's type system, concurrency model, and standard library features. It also summarizes common vulnerabilities like SQL injection and XSS, and recommends using parameterized queries and HTML escaping to prevent them. Finally, it highlights tools like Gorilla and Gin web frameworks, and techniques like rate limiting and secure cookies to build secure Go applications.
Compiler Construction | Lecture 12 | Virtual MachinesEelco Visser
The document discusses the architecture of the Java Virtual Machine (JVM). It describes how the JVM uses threads, a stack, heap, and method area. It explains JVM control flow through bytecode instructions like goto, and how the operand stack is used to perform operations and hold method arguments and return values.
The document discusses how scripting languages like Python, R, and MATLAB can be used to script CUDA and leverage GPUs for parallel processing. It provides examples of libraries like pyCUDA, rGPU, and MATLAB's gpuArray that allow these scripting languages to interface with CUDA and run code on GPUs. The document also compares different parallelization approaches like SMP, MPI, and GPGPU and levels of parallelism from nodes to vectors that can be exploited.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
HPC Essentials 0
1. HPC Essentials Prequel: From 0 to
HPC in one hour
OR
five ways to do Kriging
Bill Brouwer
Research Computing and Cyberinfrastructure
(RCC), PSU
wjb19@psu.edu
3. Step 0
● Get an account on our systems
● Check out the system details, or let us help pick one for you
● They are Linux systems, you'll need some basic commandline knowledge
– You may want to check out HPC Essentials I seminar, Unix/C overview
● We use the modules system for software, you'll need to load what you
use eg., to see a list of everything available:
module av
eg., load octave:
module load octave
To see which modules you have in your environment
module list
wjb19@psu.edu
4. Step 0
● There are two main types of systems:
– Interactive, share a single machine with one or
more users, including memory and CPUs, used for
● Debugging
● Benchmarking
● Using a program with a graphical user interface
– You'll need to log in using Exceed onDemand
● Running for short periods of time
wjb19@psu.edu
5. Step 0
● Batch systems
– Get dedicated memory and CPUs for period of time
● Maximum time is generally 24 hours
● Maximum memory and CPUs depends on the cluster
– You log in to a head node, from which you submit a request
eg., an interactive session for 1 node, 1 processor per node
(ppn) and 4gb total memory:
qsub -I -l walltime=24:00:00 -l mem=4gb -l nodes=1:ppn=1
● To check the status of your request:
qstat -u <your_psu_id>
wjb19@psu.edu
6. Step 0
● Other notes on clusters:
– Please never run anything significant on head nodes, use PBS to
submit a job instead
– If you request more than 1 CPU, remember your code/workflow needs
to be able to either
● Use multiple CPUs on a single node (set ppn parameter) using
some form of shared memory parallelism
● Use multiple CPUs on multiple nodes (set combination node &ppn
parameters) using some form of distributed memory parallelism
● A combination of the above
● Parallelism applied in an optimal way is high performance
computing
wjb19@psu.edu
7. High Performance Computing
● Using one or more forms of parallelism to
improve the performance and scaling of your
code
– Vector architecture eg., SSE/AVX in Intel CPU
– Shared memory parallelism eg., using multiple
cores of CPU
– Distributed memory parallelism eg., using
Message Passing Interface (MPI) to communicate
between CPUs or GPUs
– Accelerators eg., Graphics Processing Units
wjb19@psu.edu
8. Typical Compute Node
CPU
IOH
ICH
QuickPath Interconnect
memory bus
RAM
PCI-express
GPU
PCI-e cards
SATA/USB
Direct Media Interface
non-volatile storage
BIOS
ethernet
NETWORK
volatile storage
wjb19@psu.edu
9. CPU Architecture
● Composed of several complex processing cores, control elements
and high speed memory areas (eg., registers, L3 cache), as well as
vector elements including special registers
wjb19@psu.edu
Core Core Core Core
Cache
Memory Controller
I/O PCIe
10. Shared + Distributed Memory
Parallelism
● Shared memory parallelism is :
– usually implemented with pThreads or directive based
programming (OpenMP)
– uses one or more cores in CPU
● Distributed memory parallelism is:
– one or more nodes (composed of CPUs + possibly GPUs)
communicating with each other using high speed network eg.,
Infiniband
– network topology and fabric critical to ensuring optimal
communication
wjb19@psu.edu
11. Nvidia GPU Streaming
Multiprocessor
CUDA core
wjb19@psu.edu
32768x32 bit registers
interconnect
64kB shared mem/L1 Cache
Dispatch unit Dispatch unit
Warp scheduler Warp Scheduler
Special Function Unit x4
Load/Store Unit x16
Core x 16 x 2
Dispatch Port
FPU Int U
Operand Collector
Result Queue
● GPUs run many light-weight threads at once; device composed of
many more (simpler) cores than CPU
12. Step 1: Prototype your problem
● Pick a numerical scripting language eg., Octave, free
version of matlab
– Solid, well established, linear algebra based
● Code up a solution (eg., we'll consider ordinary kriging)
● Time all scopes/sections of your code to get a feel for
bottlenecks
● You can use the keyboard statement to set
breakpoints in your code for debugging purposes
wjb19@psu.edu
13. Step 1: Prototype your problem
● Kriging is a geospatial statistical method eg.,
predicting rainfall for locations where no
measurements exist, based on surrounding
measurements
● Solution involves:
– constructing Gamma matrix
– solve system of equations for every desired
prediction location
wjb19@psu.edu
14. Step 1: Prototype your problem
function [w,G,g,pred] = krige()
% load input data &output prediction grid
load input.csv; load output.csv;
% init
…
% Gamma; m is size of input space, x,y are coordinates for available data z
for i=1:m
for j=1:m
G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))^2+(y(i)-y(j))^2)/3.33));
end
end
% matrix inversion
Ginv = inv(G);
% predictions; n is size of output space, xp,yp are prediction coordinates
% z is available data for x,y coordinates
for i=1:n
g(1:m) = 10.*(1-exp(-sqrt((xp(i)-x).^2+(yp(i)-y).^2)/3.33));
w=Ginv * g';
pred(i) = sum(w(1:m).*z);
end
wjb19@psu.edu
15. Results 1
● Use tic/toc statements around code blocks for timing; following times are for:
– Initialization
– Gamma construction
– Matrix inversion
– Solution
Octave:1> [a b c d]=krige();
Elapsed time is 0.079224 seconds.
Elapsed time is 40.9722 seconds.
Elapsed time is 0.742576 seconds.
Elapsed time is 10.6134 seconds.
● 80% of the time is spent in constructing the matrix → need to vectorize
● Interpreted languages like Octave benefit from removing loops and replacing
with array operations
– Loops are parsed every iteration by the interpreter
– Vectorizing code by using array operations may take advantage of vector
architecture in CPU
wjb19@psu.edu
16. Step 2: Vectorize your Prototype
function [w,G,g,pred] = krige()
% load input data &output prediction grid
load input.csv; load output.csv;
% init
…
% Gamma
XI = (ones(m,1)*x)'; YI = (ones(m,1)*y)';
G(1:m,1:m) = 10.*(1-exp(-sqrt((XI-XI').^2+(YI-YI').^2)/3.33));
% matrix inversion
Ginv = inv(G);
% predictions
XP = (ones(m,1)*xp); YP = (ones(m,1)*yp);
XI = (ones(n,1)*x)'; YI = (ones(n,1)*y)';
ZI = (ones(n,1)*z)';
g(1:m,:) = 10.*(1-exp(-sqrt((XP-XI).^2+(YP-YI).^2)/3.33));
w=Ginv * g;
pred = sum(w(1:m,:).*ZI);
wjb19@psu.edu
17. Results 2
octave:2> [a b c d]=krige();
Elapsed time is 0.0765891 seconds.
Elapsed time is 0.195605 seconds.
Elapsed time is 0.758174 seconds.
Elapsed time is 3.24861 seconds.
● Code is more than 15x times faster, for a relatively small investment
● Vectorized code will have a higher memory overhead, due to the creation of
temporary arrays; harder to read too :)
● When memory or compute time become unacceptable, no choice but to move
to compiled code
● C/C++ are logical choices in a Linux environment
– Very stable, heavily used, Linux OS itself is written in C
– Expressive languages containing many innovations, algorithms and data
structures
– C++ is object oriented, allows for design of large sophisticated projects
wjb19@psu.edu
18. Step 3 : Compiled Code
● Unlike a scripted language, C/C++ must be compiled to run on the CPU,
converting a human readable language into machine code
● Several compilers are available on the clusters including Intel, PGI and
the GNU compiler collection
● In compilation and linking steps we must specify headers (with
interfaces) and libraries (with functions) need by our application
● Try to avoid reinventing the wheel, always use available libraries if you
can instead of reimplementing algorithms, data structures
● As opposed to scripting, now responsible for memory management eg.,
allocating on the heap (dynamically at runtime) or on the stack
(statically at compile time)
wjb19@psu.edu
19. Step 3 : Compiled Code
● In porting Octave/Matlab code to C/C++ you should always consider using
these libraries at least:
– Armadillo, C++ wrappers for BLAS/LAPACK, syntax very similar to
Octave/Matlab
– BLAS/LAPACK itself
● BLAS==Basic Linear Algebra
● LAPACK==Linear Algebra PACKage
● Both come in many optimized flavors eg., Intel MKL
● If you want to know more about Linux basics including writing/compiling C
code, you could check out HPC Essentials I
● If you want to know more about C++, you could check out HPC Essentials V
wjb19@psu.edu
20. Step 3 : Compiled Code
#include "armadillo"
#include <mkl.h>
#include <iostream>
using namespace std;
using namespace arma;
int main(){
mat G; vec g;
//load data, initialize variables, calculate Gamma
for (int i=0; i<m; i++)
for (int j=0; j<m; j++){
G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))*(x(i)-x(j))
+(y(i)-y(j))*(y(i)-y(j)))/3.33));
}
char uplo = 'U'; int N = m+1; int info;
int * ipiv = new int[N]; double * work = new double[3*N];
// factorize using the LU decomp. routine from LAPACK
dgetrf(&N, &N, G.memptr(), &N, ipiv, &info);
//solve
int nrhs=1; char trans='N';
for (int i=0; i<n; i++){
g.rows(0,m-1) = ...
dgetrs(&trans,&N,&nrhs,G.memptr(),&N,ipiv,g.memptr(),&N,&info);
pred(i,0)=dot(z,g.rows(0,m-1));
…
} wjb19@psu.edu
21. Results 3
● Compiled code is comparable in speed to vectorized code, although we could
make some algorithmic changes to improve further:
– The Gamma matrix is symmetric, no need to calculate values for j >= i (ie.,
just calculate/store a triangular matrix)
– Calculating the inverse is expensive and inaccurate, better to (for eg.,)
factorize a matrix and use direct solve eg., using forward/backward
substitution (we did do this, but using full matrix/LU decomp.)
– Armadillo uses operator overloading &expression templates to allow a
vectorized approach to programming, although we leave loops in for the
moment, to allow parallelization later
● If you have bugs in your code, use gdb to debug
● Always profile completely in order to solve all issues and get a complete
handle on your code
wjb19@psu.edu
22. Important Code Profiling Methods
● Solving memory leaks; use valgrind
● Poor memory access patterns/cache usage
– Use valgrind --tool=cachegrind to assess cache hits +
misses
● Heap memory usage
– Memory management has performance impact, assess with
valgrind --tool=massif
● And before you consider moving to parallel, develop a call
profile for your code eg., in terms of total instructions executed
for each scope, using valgrind --tool=callgrind
wjb19@psu.edu
23. Amdahl's Law
●
The problems in science we seek to solve are becoming increasingly
large, as we go down in scale (eg., quantum chemistry) or up (eg.,
astrophysics)
●
As a natural consequence, we seek both performance and scaling in our
scientific applications, thus we parallelize as we run out of resources using
a single processor
● We are limited by Amdahl's law, an expression of the maximum
improvement of parallel code over serial:
1/((1-P) +P/N)
where
P is the portion of application code we parallelize, and N is the number of
processors ie., as N increases, the portion of remaining serial code becomes
increasingly expensive, relatively speaking
wjb19@psu.edu
24. Amdahl's Law
● Unless the portion of code we can parallelize approaches 100%,we see
rapidly diminishing returns with increasing numbers of processors
wjb19@psu.edu
25. Step 4 : Accelerate
● In general not all algorithms are amenable, and there is the
communication bottleneck between CPU and GPU to overcome
● However, linear algebra operations are extremely efficient on GPU,
you can expect 2-10x over a whole CPU socket (ie., running all cores)
for many operations
● The language for programming Nvidia series GPUs is CUDA; much
like C but you need to know the architecture well and/or:
– Use libraries like cuBLAS (what we'll try)
– Use directive based programming in the form of openACC
– Use the OpenCL language (cross platform, but not heavily supported by Nvidia
like CUDA)
wjb19@psu.edu
26. Step 4 : Accelerate
#include "armadillo"
#include <mkl.h>
#include <iostream>
#include <cuda.h>
using namespace std;
using namespace arma;
int main(){
mat G; vec g;
//load data, initialize variables, calculate Gamma as before
//factorize using the LU decomp. routine from LAPACK, as before
//allocate memory on GPU and transfer data
//solve on gpu; two steps, solve two triangular systems
cublasDtrsm(...);
cublasDtrsm(...);
//free memory on GPU and transfer data back
}
wjb19@psu.edu
27. Results 4
● Minimal code changes, recompilation using nvcc compiler, available by loading
any CUDA module on lion-GA (where you'll also need to run)
● We still perform matrix factorization on CPU side, move data to GPU for
performing solve in two steps
● This overall solution is roughly 6x the single CPU thread solution presented
previously, for larger data sizes
● General rule of thumb → minimize communication btwn CPU + GPU, use GPU
when you can occupy all SMPs per device, don't bother for small problems,
cost of communication outweighs benefits
● There is ongoing work performed in porting LAPACK routines to GPU eg.,
check out our LU/QR work, or the significant MAGMA project from UT/ORNL
● If you're interested in trying CUDA and GPUs further, you could check out HPC
Essentials IV
wjb19@psu.edu
28. Step 5: Shared memory
●
We've determined through profiling that it's worthwhile parallelizing our loops
● By linking against Intel MKL we also have access to threaded functions
● Will simply use OpenMP directive based programming for this example
● We are generally responsible for deciding what variables need to be shared
by threads, and which variables should be privately owned by threads
● If we fail to make these distinctions where needed, we end up with race
conditions
– Threads operate on data in an uncoordinated fashion ,and data elements have
unpredictable/erroneous values
● Outside the scope of this talk, but just as pernicious is deadlock, when
threads (and indeed whole programs) hang due to improper coordination
wjb19@psu.edu
29. Step 5 : Shared Memory
#include "armadillo"
#include <mkl.h>
#include <iostream>
#include <omp.h>
...
int main(){
...
//load data, initialize variables, calculate Gamma
#pragma omp parallel for
for (int i=0; i<m; i++)
for (int j=0; j<m; j++){
G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))*(x(i)-x(j))
+(y(i)-y(j))*(y(i)-y(j)))/3.33));
}
// factorize using the LU decomp. routine from LAPACK
dgetrf(&N, &N, G.memptr(), &N, ipiv, &info);
//initialize data for solve, for all right hand sides
#pragma omp parallel for
for (int i=0; i<n; i++)
for (int j=0; j<m; j++)
g(i,j) = ...
//multithreaded solve for all RHS
dgetrs(&trans,&N,&n,G.memptr(),&N,ipiv,g.memptr(),&N,&info);
//assemble predictions
wjb19@psu.edu
30. Results 5
● In linking, must specify -fopenmp if using GNU compiler,
or -openmp for Intel
● At runtime, need to export the environment variable
OMP_NUM_THREADS to the desired number
● Exporting this number to something beyond the total
number of cores you have access to will result in severe
performance degradation
● Outside the scope of this talk, but often need to tune CPU
affinity for best performance
● For more information, please check out HPC Essentials II
wjb19@psu.edu
31. Step 6 : Distributed Memory
● A good motivation for moving to distributed memory is, in a simple case,
a shortage of memory on a single node
● From a practical perspective, scheduling distributed CPU cores is easier
than shared memory cores ie., your PBS queuing time is shorter :)
● We will use the message passing interface (MPI), a venerable standard
developed over the last 20 years or so, with language bindings for C
and fortran
● On the clusters, we use OpenMPI (not to be confused with OpenMP);
once you load the module, by using the wrapper compilers, compilation
and linking paths are taken care of for you
● Aside from needing to link with other libraries like Intel MKL, compiling
and linking a C++ MPI program can be as simple as :
module load openmpi
mpic++ my_program.cpp
wjb19@psu.edu
32. Step 6 : Distributed Memory
#include "armadillo"
#include <mkl.h>
#include <iostream>
#include <mpi.h>
int main(int argc, char * argv[]){
int rank, size;
MPI_Status status;
MPI_Init(&argc, &argv);
// size== total processes in this MPI_COMM_WORLD pool
MPI_Comm_size(MPI_COMM_WORLD, &size);
// rank== my identifier in pool
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// load data, initialize variables, calculate Gamma, perform factorization
// solve just for my portion of predictions
int lower = (rank * n) / size;
int upper = ((rank+1) * n) / size)-1;
for (int i=lower; i<upper; i++){
g.rows(0,m-1) = ...
dgetrs(&trans,&N,&nrhs,G.memptr(),&N,ipiv,g.memptr(),&N,&info);
pred(i,0)=dot(z,g.rows(0,m-1));
…
}
//gather results back to root process
wjb19@psu.edu
33. Results 6
● When you run a MPI job using PBS, you need to use the mpirun script to
setup your environment and spawn processes on the different CPUs allocated
to you by the scheduler:
mpirun my_application.x
●
Here we simply divided the output space between different processors ie.,
each processor in the pool calculated a portion of the predictions
●
However a collective call was needed (not shown) after the solve steps, a
gather statement to bring all the results to the root process (with rank 0)
● This was the only communication between different processes throughout the
calculation ie., this was close to embarrassingly parallel → no communication,
great scaling with processors
● Despite the high bandwidths available on modern networks, the cost of
latency is generally the limiting factor is using distributed memory parallelism
● For more on MPI you could check out HPC Essentials III
wjb19@psu.edu
34. Review
● Let's review some of the things we've
discussed
● I'll splash up several scenario's and we'll
attempt to score them
wjb19@psu.edu
35. Score Card
Score What this feels like Your HPC vehicle
+5 Civilized society Something German
+4 Evening with friends American Muscle
+3 Favorite show renewed A Honda
+2 Twinkies are back Sonata
+1 A fairy gets its wings Camry
0 meh Corolla
-1 A fairy dies Neon
-2 Twinkies are gone Pinto
-3 Favorite show canceled Le Barron
-4 Evening with Facebook Yugo
-5 Zombie Apocalypse Abrams tank
wjb19@psu.edu
36. Scenario 1
● You get an account for hammer, maybe install
and use Exceed onDemand, load and use the
Matlab/Octave module after logging in
wjb19@psu.edu
37. Scenario 1
● Score : 0
● Meh. You'll run a little faster, probably have
more memory. But this isn't HPC and you
could almost do this on your laptop. You're
driving a Corolla, doing 45 mp/h in the fast
lane.
wjb19@psu.edu
38. Scenario 2
● You vectorize your loops and/or create a
compiled MEX (Matlab) or OCT (Octave)
function
wjb19@psu.edu
39. Scenario 2
● A fairy gets its wings! You move up to the Camry!
● By vectorizing loops you use internal functions that are
interpreted once at runtime, and under the hood may even
get to utilize the vector architecture of the CPU.
● Tricky loops eg., those with conditionals are best converted to
MEX/OCT functions eg., for OCTAVE you want the
mkoctfile utility
● If compiling new functions, don't forget to link with HPC
libraries eg., Intel MKL or AMD ACML where possible.
wjb19@psu.edu
40. Scenario 3
● Instead of submitting a PBS job you do all this
on the head node of a batch cluster
wjb19@psu.edu
41. Scenario 3
● A fairy dies! You drive a Neon at 35 mp/h in the HPC fastlane!
●
Things could be worse for you, but using memory and CPU on head
nodes can grind processes like parallel filesystems to a halt, making
other users and sys admin feel downright melancholy. Screens
freeze, commands return at the speed of pitchblende.
● If you need dedicated resources and/or to run for more than a few
minutes, please use an interactive cluster or PBS :
https://rcc.its.psu.edu/user_guides/system_utilities/pbs/
wjb19@psu.edu
42. Scenario 4
● You use Armadillo to port your Matlab/Octave
code to C++, and use version control to
manage your project (eg., SVN, git/github)
wjb19@psu.edu
43. Scenario 4
● Twinkies are back! You think Hyundai finally have it
together and splash out on the Sonata!
● Vectorized Octave/Matlab code is hard to beat.
However you may wish to scale outside the node
someday, integrate into an existing C++ project or
perhaps use rich C++ objects (found in Boost for
eg.,) so this is the way to go. Actually there are
myriad reasons.
● Don't forget to compile first with '-Wall -g' options,
then when it's working and you get the right answer,
optimize!
wjb19@psu.edu
44. Scenario 5
● You port your Matlab/Octave code to C++
without use of libraries or version control
wjb19@psu.edu
45. Scenario 5
● No twinkies! You drive a pinto that bursts into flames
immediately!
● Reinventing the wheel is a very bad, time
consuming idea. Armadillo uses expression
templates to create very efficient code at compile
time, without it you could end up with an inefficient
mess.
● Neglect to use version control and you will surely
regret it. Probably right around a publication
deadline too. And while we're on the topic please
backup your data.
wjb19@psu.edu
46. Scenario 6
● You target sections of your version controlled
C++ code for acceleration, after understanding
it better by profiling using valgrind
--tool=callgrind
wjb19@psu.edu
47. Scenario 6
● Score : +3
● Futurama is back! You get a new civic!
● Believe the hype, GPUs are here to stay and will
accelerate many algorithms, especially linear algebra.
● Take advantage of libraries like CUBLAS before rolling
your own code, check in at the CUDAZONE to see what
applications and code examples exist already. Get familiar
with CUDA, we are an Nvidia CUDA Research Center :
https://research.nvidia.com/content/penn-state-crc-
summary
wjb19@psu.edu
48. Scenario 7
● Your non-version controlled C++ code has bad
memory access patterns, memory leaks, creates
many temporaries.
● Score : -3
● Bye-Bye Futurama ! Hello Le Barron!
● Ignore good memory and cache access patterns at
your peril
● Use valgrind (default) and valgrind --tool=cachegrind
to learn more. Avoid temporaries by using libraries
like Armadillo, or learning and using expression
templates.
wjb19@psu.edu
49. Scenario 8
● Scenario 6 and you introduce shared memory parallelism using
OpenMP. You look into and tune CPU affinity.
● Score : +4
● You provide Babette's feast for your friends and elicit a
penchant for the Ford mustang.
● OpenMP is relatively easy eg., a pragma around a for loop.
● Don't forget to check thread performance with valgrind
--tool=helgrind
● Now your code is a thing of beauty, properly version controlled,
profiled completely (well you could run massif as well) and
you're able to use all the compute hardware in a single
heterogeneous node.
wjb19@psu.edu
50. Scenario 9
● Scenario 7 AND you decide to thrash disk. Plus you try to
write >= 1M files
● Score : -4
● Yugo is only cool in that Portlandia bit, and Facebook was
only good for a brief period in 2006.
● Disk I/O kills in a HPC context, plus the maximum file limit at
time of writing is 1M
– You give control to the kernel and your application ceases to
execute for some time (a voluntary context switch)
– You might be contending for disk with other processes
– You introduce the lower memory bandwidth (BW) and higher
latency (Delta) of disk versus system memory
– Parallel filesystems → all of the above plus network BW and Delta
wjb19@psu.edu
51. Scenario 10
● Scenario 8 AND you decide to scale outside the
node with MPI. You look into Patterns. GOF is on
the nightstand.
● Score : +5
● You are a cultured individual and you drive a
German vehicle. You care about engineering.
● Don't forget Amdahl's law
● Even with IB networks, minimize communication,
consider new paradigms in distributed memory
parallelism (check out MPI revision 3).
wjb19@psu.edu
52. Scenario 11
● Scenario 9 and you do it all on the head node,
including OpenMP for 1% of your loops. You also
export OMP_NUM_THREADS=20 and you have
10 cores. There's no coordination between
threads, races all over the place. You have
about 40 MPI processes trying to read the same
file as well, without parallel file I/O.
wjb19@psu.edu
53. Scenario 11
● Score : -5
● The end is nigh and you're taking out zombies and
HPC infrastructure in your Abrams tank, moving at
1mph, getting 0.2 miles to the gallon
● You ignored all the other advice, and now you throw
out Amdahl's law too.
● AND you have no coordination between any of your
threads or processes.
● AND you're trying to run more threads and processes
than the system can support concurrently, so context
switching takes place furiously.
● Expect a not-so-rosy email from sys admin :-)wjb19@psu.edu
54. Summary
● High performance computing is leveraging one or more
forms of parallelism in a performant way
● Often the best gains come from writing vectorized octave
code, or making algorithmic changes
● Before you parallelize, fully profile your code and keep
Amdahl's law in mind
● All forms of parallelism have their limitations, but in
general:
– GPU accelerators are excellent for linear algebra
– Shared memory using OpenMP works well for simple, nested
loops
– Consider using MPI (distributed memory) for 'big data', but
limit communication wjb19@psu.edu