This document discusses using source-to-source compilers to address high performance computing challenges. It presents a source-to-source approach using the ROSE compiler infrastructure to generate programming models for heterogeneous computing via directives, enable performance optimization through autotuning, and improve resilience by adding source-level redundancy. Preliminary results are shown for accelerating common kernels like AXPY, matrix multiplication, and Jacobi iteration using a prototype source-to-source OpenMP compiler called HOMP to target GPUs.
Survey of Program Transformation TechnologiesChunhua Liao
The first workshop for conceptualization of a Software Institute for Abstractions and Methodologies for HPC Simulations Codes on Future Architectures. Dec. 10th, 2012 Chicago, IL, USA
A hands-on introduction to the ELF Object file formatrety61
In our 6th semester we developed miASMa - a 2 pass Macro Assembler for an x86 machine. miASMa generates Relocatable Object Files that conforming to the ELF Format.
Survey of Program Transformation TechnologiesChunhua Liao
The first workshop for conceptualization of a Software Institute for Abstractions and Methodologies for HPC Simulations Codes on Future Architectures. Dec. 10th, 2012 Chicago, IL, USA
A hands-on introduction to the ELF Object file formatrety61
In our 6th semester we developed miASMa - a 2 pass Macro Assembler for an x86 machine. miASMa generates Relocatable Object Files that conforming to the ELF Format.
**Return-oriented programming** bezeichnet eine gewiefte IT-Angriffstechnik, die im Prinzip eine Verallgemeinerung von *return-to-libc*-Attacken ist, welche wiederum zu den *stack buffer overflow exploits* gehören.
Wem das alles nichts sagt - keine Angst: Im Vortrag werden zunächst die Grundlagen von Puffer-Überläufen und deren Angriffspotential erläutert und einige historische Beispiele aufgezeigt, bevor schrittweise die Brücke zu **ROP** geschlagen wird. Zum Abschluss werden kurz einige Abwehrmaßnahmen vorgestellt und im Hinblick auf Umsetzbarkeit und Wirkungsgrad bewertet.
So die Demo-Götter es wollen, wird live u.A. ein Beispiel-Programm mithilfe von **ROP**-Tools gecrackt.
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Mingliang Liu
Parallel application benchmarks are indispensable for evaluating/optimizing HPC software and hardware. However, it is very challenging and costly to obtain high-fidelity benchmarks reflecting the scale and complexity of state-of-the-art parallel applications. Hand-extracted synthetic benchmarks are time- and labor-intensive to create. Real applications themselves, while offering most accurate performance evaluation, are expensive to compile, port, reconfigure, and often plainly inaccessible due to security or ownership concerns. This work contributes APPrime, a novel tool for trace-based automatic parallel benchmark generation. Taking as input standard communication-I/O traces of an application's execution, it couples accurate automatic phase identification with statistical regeneration of event parameters to create compact, portable, and to some degree reconfigurable parallel application benchmarks. Experiments with four NAS Parallel Benchmarks (NPB) and three real scientific simulation codes confirm the fidelity of APPrime benchmarks. They retain the original applications' performance characteristics, in particular their relative performance across platforms. Also, the result benchmarks, already released online, are much more compact and easy-to-port compared to the original applications.
http://dl.acm.org/citation.cfm?id=2745876
eBPF Debugging Infrastructure - Current TechniquesNetronome
eBPF (extended Berkeley Packet Filter), in particular with its driver-level hook XDP (eXpress Data Path), has increased in importance over the past few years. As a result, the ability to rapidly debug and diagnose problems is becoming more relevant. This talk will cover common issues faced and techniques to diagnose them, including the use of bpftool for map and program introspection, the use of disassembly to inspect generated assembly code and other methods such as using debug prints and how to apply these techniques when eBPF programs are offloaded to the hardware.
The talk will also explore where the current gaps in debugging infrastructure are and suggest some of the next steps to improve this, for example, integrations with tools such as strace, valgrind or even the LLDB debugger.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
FISL XIV - The ELF File Format and the Linux LoaderJohn Tortugo
These are the slides used in a lecture I gave in the XIV International Board on Free Software. In this lecture I gave a brief overview of the ELF specification (the ELF specification is a document describing the format of executable, shared libraries and relocatable objects files used in Linux and many others operating systems) and the Linux dynamic loader (which is a program that acts together with the OS to create and initialize a program address space among others tasks).
This slide deck focuses on eBPF JIT compilation infrastructure and how it plays an important role in the entire eBPF life cycle inside the Linux kernel. First, it does quite a number of control flow checks to reject vulnerable programs and then JIT compiles the eBPF program to either host or offloading target instructions which boost performance. However, there is little documentation about this topic which this slide deck will dive into.
FARIS: Fast and Memory-efficient URL Filter by Domain Specific MachineYuuki Takano
http://ytakano.github.io/
http://ieeexplore.ieee.org/document/7740332/
Uniform resource locator (URL) filtering is a fundamental technology for intrusion detection, HTTP proxies, content distribution networks, content-centric networks, and many other application areas. Some applications adopt URL filtering to protect user privacy from malicious or insecure websites. Some web browser extensions, such as AdBlock Plus, provide a URL-filtering mechanism for sites that intend to steal sensitive information.
Unfortunately, these extensions are implemented inefficiently, resulting in a slow application that consumes much memory. Although it provides a domain-specific language (DSL) to represent URLs, it internally uses regular expressions and does not take advantage of the benefits of the DSL. In addition, the number of filter rules become large, which makes matters worse.
In this paper, we propose the fast uniform resource identifier- specific filter, which is a domain-specific pseudo-machine for the DSL, to dramatically improve the performance of some browser extensions. Compared with a conventional implementation that internally adopts regular expressions, our proof-of-concept implementation is fast and small memory footprint.
This PPT discusses the concept of Dynamic Linker as in Linux and its porting to Solaris ARM platform. It starts from the very basics of linking process
eBPF Tooling and Debugging InfrastructureNetronome
eBPF, in particular with its driver-level hook XDP, has increased in importance over the past few years. As a result, the ability to rapidly debug and diagnose problems is becoming more relevant. This session will cover common issues faced and techniques to diagnose them, including the use of bpftool for map and program introspection, the disassembling of programs to inspect generated eBPF instructions and other methods such as using debug prints and how to apply these techniques when eBPF programs are offloaded to the hardware.
Measuring the time spent on small individual fractions of program code is a common technique for analysing performance behavior and detecting performance bottlenecks. The benefits of the approach include a detailed individual attribution of performance and understandable feedback loops when experimenting with different code versions. There are however severe pitfalls when following this approach that can lead to vastly misleading results. Modern dynamic compilers use complex optimisation techniques that take a large part of the program into account. There can be therefore unexpected side-effects when combining different code snippets or even when running a presumably unrelated part of the code. This talk will present performance paradoxes with examples from the domain of dynamic compilation of Java programs. Furthermore, it will discuss an alternative approach to modelling code performance characteristics that takes the challenges of complex optimising compilers into account.
Short introduction to ROP which is a basic in computer security exploitation and implemented defenses. It contains useful links for further research at the end.
Talk @ APT Group, University of Manchester, 06 August 2014
Abstract:
Nowadays HPC systems, such as those in the Top500, are equipped with a range of different processors, from multi-core CPUs to GPUs. Programming them can be a tough job, specially if we want to squeeze every last FLOPs of performance out of them.
As a Phd Student, I am now doing a brief research visit in the APT group, working in topics related to the programmability and efficient use of GPUs and many-core coprocessors. In particular, I am implementing a large database operation using OpenCL in these state-of-the-art systems. In this talk I will summarize my work in Manchester and discuss the future work in this topic.
**Return-oriented programming** bezeichnet eine gewiefte IT-Angriffstechnik, die im Prinzip eine Verallgemeinerung von *return-to-libc*-Attacken ist, welche wiederum zu den *stack buffer overflow exploits* gehören.
Wem das alles nichts sagt - keine Angst: Im Vortrag werden zunächst die Grundlagen von Puffer-Überläufen und deren Angriffspotential erläutert und einige historische Beispiele aufgezeigt, bevor schrittweise die Brücke zu **ROP** geschlagen wird. Zum Abschluss werden kurz einige Abwehrmaßnahmen vorgestellt und im Hinblick auf Umsetzbarkeit und Wirkungsgrad bewertet.
So die Demo-Götter es wollen, wird live u.A. ein Beispiel-Programm mithilfe von **ROP**-Tools gecrackt.
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Mingliang Liu
Parallel application benchmarks are indispensable for evaluating/optimizing HPC software and hardware. However, it is very challenging and costly to obtain high-fidelity benchmarks reflecting the scale and complexity of state-of-the-art parallel applications. Hand-extracted synthetic benchmarks are time- and labor-intensive to create. Real applications themselves, while offering most accurate performance evaluation, are expensive to compile, port, reconfigure, and often plainly inaccessible due to security or ownership concerns. This work contributes APPrime, a novel tool for trace-based automatic parallel benchmark generation. Taking as input standard communication-I/O traces of an application's execution, it couples accurate automatic phase identification with statistical regeneration of event parameters to create compact, portable, and to some degree reconfigurable parallel application benchmarks. Experiments with four NAS Parallel Benchmarks (NPB) and three real scientific simulation codes confirm the fidelity of APPrime benchmarks. They retain the original applications' performance characteristics, in particular their relative performance across platforms. Also, the result benchmarks, already released online, are much more compact and easy-to-port compared to the original applications.
http://dl.acm.org/citation.cfm?id=2745876
eBPF Debugging Infrastructure - Current TechniquesNetronome
eBPF (extended Berkeley Packet Filter), in particular with its driver-level hook XDP (eXpress Data Path), has increased in importance over the past few years. As a result, the ability to rapidly debug and diagnose problems is becoming more relevant. This talk will cover common issues faced and techniques to diagnose them, including the use of bpftool for map and program introspection, the use of disassembly to inspect generated assembly code and other methods such as using debug prints and how to apply these techniques when eBPF programs are offloaded to the hardware.
The talk will also explore where the current gaps in debugging infrastructure are and suggest some of the next steps to improve this, for example, integrations with tools such as strace, valgrind or even the LLDB debugger.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
FISL XIV - The ELF File Format and the Linux LoaderJohn Tortugo
These are the slides used in a lecture I gave in the XIV International Board on Free Software. In this lecture I gave a brief overview of the ELF specification (the ELF specification is a document describing the format of executable, shared libraries and relocatable objects files used in Linux and many others operating systems) and the Linux dynamic loader (which is a program that acts together with the OS to create and initialize a program address space among others tasks).
This slide deck focuses on eBPF JIT compilation infrastructure and how it plays an important role in the entire eBPF life cycle inside the Linux kernel. First, it does quite a number of control flow checks to reject vulnerable programs and then JIT compiles the eBPF program to either host or offloading target instructions which boost performance. However, there is little documentation about this topic which this slide deck will dive into.
FARIS: Fast and Memory-efficient URL Filter by Domain Specific MachineYuuki Takano
http://ytakano.github.io/
http://ieeexplore.ieee.org/document/7740332/
Uniform resource locator (URL) filtering is a fundamental technology for intrusion detection, HTTP proxies, content distribution networks, content-centric networks, and many other application areas. Some applications adopt URL filtering to protect user privacy from malicious or insecure websites. Some web browser extensions, such as AdBlock Plus, provide a URL-filtering mechanism for sites that intend to steal sensitive information.
Unfortunately, these extensions are implemented inefficiently, resulting in a slow application that consumes much memory. Although it provides a domain-specific language (DSL) to represent URLs, it internally uses regular expressions and does not take advantage of the benefits of the DSL. In addition, the number of filter rules become large, which makes matters worse.
In this paper, we propose the fast uniform resource identifier- specific filter, which is a domain-specific pseudo-machine for the DSL, to dramatically improve the performance of some browser extensions. Compared with a conventional implementation that internally adopts regular expressions, our proof-of-concept implementation is fast and small memory footprint.
This PPT discusses the concept of Dynamic Linker as in Linux and its porting to Solaris ARM platform. It starts from the very basics of linking process
eBPF Tooling and Debugging InfrastructureNetronome
eBPF, in particular with its driver-level hook XDP, has increased in importance over the past few years. As a result, the ability to rapidly debug and diagnose problems is becoming more relevant. This session will cover common issues faced and techniques to diagnose them, including the use of bpftool for map and program introspection, the disassembling of programs to inspect generated eBPF instructions and other methods such as using debug prints and how to apply these techniques when eBPF programs are offloaded to the hardware.
Measuring the time spent on small individual fractions of program code is a common technique for analysing performance behavior and detecting performance bottlenecks. The benefits of the approach include a detailed individual attribution of performance and understandable feedback loops when experimenting with different code versions. There are however severe pitfalls when following this approach that can lead to vastly misleading results. Modern dynamic compilers use complex optimisation techniques that take a large part of the program into account. There can be therefore unexpected side-effects when combining different code snippets or even when running a presumably unrelated part of the code. This talk will present performance paradoxes with examples from the domain of dynamic compilation of Java programs. Furthermore, it will discuss an alternative approach to modelling code performance characteristics that takes the challenges of complex optimising compilers into account.
Short introduction to ROP which is a basic in computer security exploitation and implemented defenses. It contains useful links for further research at the end.
Talk @ APT Group, University of Manchester, 06 August 2014
Abstract:
Nowadays HPC systems, such as those in the Top500, are equipped with a range of different processors, from multi-core CPUs to GPUs. Programming them can be a tough job, specially if we want to squeeze every last FLOPs of performance out of them.
As a Phd Student, I am now doing a brief research visit in the APT group, working in topics related to the programmability and efficient use of GPUs and many-core coprocessors. In particular, I am implementing a large database operation using OpenCL in these state-of-the-art systems. In this talk I will summarize my work in Manchester and discuss the future work in this topic.
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
JT Kellington, IBM and Allan Cantle, Nallatech present at the 2015 HPCC Systems Engineering Summit Community Day about porting HPCC Systems to the POWER8-based ppc64el architecture.
In this video from the 2015 Stanford HPC Conference, Pavel Shamis from ORNL presents: Preparing OpenSHMEM for Exascale.
"OpenSHMEM is a partitioned global address space (PGAS) one-sided communications library that enables remote memory access (RMA) across processing elements (PEs). Its API allows data to be transferred from one PE memory space to another PE’s symmetric memory space; decoupling the data transfers from synchronizations. OpenSHMEM is useful for applications that are latency driven or that have irregular communication patterns, because its one-sided API can be mapped very efficiently to hardware (e.g. RDMA interconnects, etc), and its one-sided programming model helps the overlapping of communication with computation. Summit is Oak Ridge National Laboratory’s next high performance supercomputer system that will be based on a many core/GPU hybrid architecture. In order to prepare OpenSHMEM for future systems, it is important to enhance its programming model to enable efficient utilization of the new hardware capabilities (e.g. massive multithreaded systems, accesses different type memories, next generation of interconnects, etc). This session will present recent advances in the area of OpenSHMEM extensions, implementations, and tools.”
Watch the video: http://insidehpc.com/2015/02/video-preparing-openshmem-for-exascale/
See more talks in the Stanford HPC Conference Video Gallery: http://wp.me/P3RLHQ-dOO
This presentation by Stanislav Donets (Lead Software Engineer, Consultant, GlobalLogic, Kharkiv) was delivered at GlobalLogic Kharkiv C++ Workshop #1 on September 14, 2019.
In this talk were covered:
- Graphics Processing Units: Architecture and Programming (theory).
- Scratch Example: Barnes Hut n-Body Algorithm (practice).
Conference materials: https://www.globallogic.com/ua/events/kharkiv-cpp-workshop/
Impala Architecture Presentation at Toronto Hadoop User Group, in January 2014 by Mark Grover.
Event details:
http://www.meetup.com/TorontoHUG/events/150328602/
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
Apache Hadoop started as batch: simple, powerful, efficient, scalable, and a shared platform. However, Hadoop is more than that. It's true strengths are:
Scalability – it's affordable due to it being open-source and its use of commodity hardware for reliable distribution.
Schema on read – you can afford to save everything in raw form.
Data is better than algorithms – More data and a simple algorithm can be much more meaningful than less data and a complex algorithm.
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
1. This work was performed under the auspices of the U.S. Department
of Energy by Lawrence Livermore National Laboratory under Contract
DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
A Source-To-Source Approach to HPC
Challenges
LLNL-PRES- 668361
2. Lawrence Livermore National Laboratory 2
Agenda
Background
• HPC challenges
• Source-to-source compilers
Source-to-source compilers to address HPC
challenges
• Programming models for heterogeneous computing
• Performance via autotuning
• Resilience via source level redundancy
Conclusion
Future Interests
3. Lawrence Livermore National Laboratory 3
HPC Challenges
HPC: High Performance
Computing
• DOE applications: climate
modeling, combustion, fusion
energy, biology, material
science, etc.
• Network connected (clustered)
computation nodes
• Increasing node complexity
Challenges
• Parallelism, performance,
productivity, portability
• Heterogeneity, power
efficiency, resilience
J. A. Ang, et al. Abstract machine models and proxy architectures
for exascale (1019 FLOPS) computing. Co-HPC '14
4. Lawrence Livermore National Laboratory 4
Compilers: traditional vs. source-to-
source
Source
code
Frontend
(parsing)
Middle End
(analysis&
transformation&
optimizations)
Backend
(code
generation)
Machine
code
Source
code
Frontend
(parsing)
Middle End
(analysis&
transformation&
optimizations)
Backend
(unparsing)
Source
code
5. Lawrence Livermore National Laboratory 5
Why Source-to-Source?
Traditional Compilers
(GCC, ICC, LLVM..)
Source-to-source compilers
(ROSE, Cetus, ..)
Goal Use compiler technology to
generate fast & efficient
executables for a target platform
Make compiler technology accessible, by
enabling users to build customized source
code analysis and transformation
compilers/tools
Users Programmers Researchers and developers in compiler,
tools, and application communities.
Input Programs in multiple languages Programs in multiple languages
Internal
representation
Multiple levels of Intermediate
representations,
losing source code structure and
information
Abstract Syntax Tree, keeping all source
level structure and information
(comments, preprocessing info. etc),
extensible.
Analyses &
optimizations
Internal use only, black box to
users
Individually exposed to users,
composable and customable
Output Machine executables, hard to
read
Human-readable, compilable source code
Portability Multiple code generators for
multiple platforms
Single code generator for all platforms,
leveraging traditional backend compilers
6. Lawrence Livermore National Laboratory 6
ROSE Compiler Infrastructure
C/C++/Fortran
OpenMP/UPC
Source Code
Analyzed/
Transformed
Source
EDG Front-end/
Open Fortran Parser
IR
(AST)
Unparser
ROSE compiler infrastructure
2009
Winner
www.roseCompiler.org
Control Flow
ROSE-based source-to-source tools
System Dependence
Data Dependence
Backend
Compiler
Executable
7. Lawrence Livermore National Laboratory 7
ROSE IR = AST + symbol tables +
interface
Simplified AST
without symbol tables
int main()
{
int a[10];
for(int i=0;i<10;i++)
a[i]=i*i;
return 0;
}
ROSE AST(abstract syntax tree)
• High level: preserve all source level details
• Easy to use: rich set of interface functions for
analysis and transformation
• Extensible: user defined persistent attributes
8. Lawrence Livermore National Laboratory 8
Using ROSE to help address HPC
challenges
Parallel programming model research
• MPI, OpenMP, OpenACC, UPC, Co-Array Fortran, etc.
• Domain-specific languages, etc.
Program optimization/translation tools
• Automatic parallelization to migrate legacy codes
• Outlining and parameterized code generation to support auto tuning
• Resilience via code redundancy
• Application skeleton extraction for Co-Design
• Reverse computation for discrete event simulation
• Bug seeding for testing
Program analysis tools
• Programming styles and coding standard enforcement
• Error checking for serial, MPI and OpenMP programs
• Worst execution time analysis
9. Lawrence Livermore National Laboratory 9
Case 1: high-level programming models
for heterogeneous computing
Problem
• Heterogeneity challenge of exascale
• Naïve approach: CPU code+
accelerator code + data transferring
in between
Approach
• Develop high-level, directive based
programming models
— Automatically generate low level
accelerator code
• Prototype and extend the OpenMP
4.0 accelerator model
— Result: HOMP (Heterogeneous
OpenMP) compiler
J. A. Ang, et al. Abstract machine models and proxy
architectures for exascale (1019 FLOPS) computing.
Co-HPC '14
10. Lawrence Livermore National Laboratory 10
Programming models bridge algorithms and machines and are
implemented through components of software stack
Measures of success:
• Expressiveness
• Performance
• Programmability
• Portability
• Efficiency
•…
Language
Compiler
Library
Algorithm
Application
Abstract
Machine
Executable
Real
Machine
Programming Model
Express
Execute
Compile/link
…
Software Stack
11. Lawrence Livermore National Laboratory 11
Parallel programming models are built on top of
sequential ones and use a combination of
language/compiler/library support
CPU
MemoryAbstract
Machine
(overly
simplified) CPU
Shared Memory
CPU
CPU
Memory
CPU
Memory
Interconnect
…
Programming
Model
Sequential
Parallel
Shared Memory (e.g. OpenMP) Distributed Memory (e.g. MPI)
…
Software
Stack
General purpose
Languages (GPL)
C/C++/Fortran
Sequential
Compiler
Optional Seq. Libs
GPL + Directives
Seq. Compiler
+ OpenMP support
OpenMP Runtime Lib
GPL + Call to MPI libs
Seq. Compiler
MPI library
12. Lawrence Livermore National Laboratory 12
OpenMP Code
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#endif
int main(void)
{
#pragma omp parallel
printf("Hello,world! n");
return 0;
}
…
//PI calculation
#pragma omp parallel for reduction (+:sum) private (x)
for (i=1;i<=num_steps;i++)
{
x=(i-0.5)*step;
sum=sum+ 4.0/(1.0+x*x);
}
pi=step*sum;
..
Programmer compiler Runtime library OS threads
13. Lawrence Livermore National Laboratory 13
Shared storage
OpenMP 4.0’s accelerator model
Device: a logical execution engine
• Host device: where OpenMP program
begins, one only
• Target devices: 1 or more accelerators
Memory model
• Host data environment: one
• Device data environment: one or more
• Allow shared host and device memory
Execution model: Host-centric
• Host device : “offloads” code regions
and data to target devices
• Target devices: fork-join model
• Host waits until devices finish
• Host executes device regions if no
accelerators are available /supported
Host Device
Proc. Proc.
Target Device
Proc. Proc.
OpenMP 4.0 Abstract Machine
Host Memory Device Memory
… …
… …
14. Lawrence Livermore National Laboratory 14
Computation and data offloading
Directive: #pragma omp target device(id) map() if()
• target: create a device data environment and offload computation to be run in
sequential on the same device
• device (int_exp): specify a target device
• map(to|from|tofrom|alloc:var_list) : data mapping between the current
task’s data environment and a device data environment
Directive: #pragma omp target data device(id) map() if()
• Create a device data environment
omp target
CPU thread
omp parallel
Accelerator threads
CPU thread
15. Lawrence Livermore National Laboratory 15
Example code: Jacobi
while ((k<=mits)&&(error>tol))
{
// a loop copying u[][] to uold[][] is omitted here
…
#pragma omp parallel for private(resid,j,i) reduction(+:error)
for (i=1;i<(n-1);i++)
for (j=1;j<(m-1);j++)
{
resid = (ax*(uold[i-1][j] + uold[i+1][j])
+ ay*(uold[i][j-1] + uold[i][j+1])+ b * uold[i][j] - f[i][j])/b;
u[i][j] = uold[i][j] - omega * resid;
error = error + resid*resid ;
} // the rest code omitted ...
}
#pragma omp target data device (gpu0) map(to:n, m, omega, ax, ay, b,
f[0:n][0:m]) map(tofrom:u[0:n][0:m]) map(alloc:uold[0:n][0:m])
#pragma omp target device(gpu0) map(to:n, m, omega, ax, ay, b, f[0:n][0:m],
uold[0:n][0:m]) map(tofrom:u[0:n][0:m])
16. Lawrence Livermore National Laboratory 16
Building the accelerator support
Language: OpenMP 4.0 directives and clauses
• target, device, map, collapse, target data, …
Compiler:
• CUDA kernel generation: outliner
• CPU code generation: kernel launching
• Loop mapping (1-D vs. 2-D), loop transformation (normalize, collapse)
• Array linearization
Runtime:
• Device probing, configure execution settings
• Loop scheduler: static even, round-robin
• Memory management : data, allocation, copy, deallocation, reuse
• Reduction: contiguous reduction pattern (folding)
17. Lawrence Livermore National Laboratory 17
Example code: Jacobi using
OpenMP accelerator directives
while ((k<=mits)&&(error>tol))
{
// a loop copying u[][] to uold[][] is omitted here
…
#pragma omp parallel for private(resid,j,i) reduction(+:error)
for (i=1;i<(n-1);i++)
for (j=1;j<(m-1);j++)
{
resid = (ax*(uold[i-1][j] + uold[i+1][j])
+ ay*(uold[i][j-1] + uold[i][j+1])+ b * uold[i][j] - f[i][j])/b;
u[i][j] = uold[i][j] - omega * resid;
error = error + resid*resid ;
} // the rest code omitted ...
}
#pragma omp target data device (gpu0) map(to:n, m, omega, ax, ay, b,
f[0:n][0:m]) map(tofrom:u[0:n][0:m]) map(alloc:uold[0:n][0:m])
#pragma omp target device(gpu0) map(to:n, m, omega, ax, ay, b, f[0:n][0:m],
uold[0:n][0:m]) map(tofrom:u[0:n][0:m])
19. Lawrence Livermore National Laboratory 19
CPU Code: data management, exe
configuration, launching, reduction
1. xomp_deviceDataEnvironmentEnter(); /* Initialize a new data environment, push it to a stack */
2. float *_dev_u = (float*) xomp_deviceDataEnvironmentGetInheritedVariable ((void*)u, _dev_u_size);
3. /* If not inheritable, allocate and register the mapped variable */
4. if (_dev_u == NULL)
5. { _dev_u = ((float *)(xomp_deviceMalloc(_dev_u_size)));
6. /* Register this mapped variable original address, device address, size, and a copy-back flag */
7. xomp_deviceDataEnvironmentAddVariable ((void*)u, _dev_u_size, (void*) _dev_u, true);
8. }
9. /* Execution configuration: threads per block and total block numbers */
10. int _threads_per_block_ = xomp_get_maxThreadsPerBlock();
11. int _num_blocks_ = xomp_get_max1DBlock((n - 1) - 1);
12. /* Launch the CUDA kernel ... */
13. OUT__1__10117__<<<_num_blocks_,_threads_per_block_,(_threads_per_block_ * sizeof(float ))>>>
14. (n,m,omega,ax,ay,b,_dev_per_block_error,_dev_u,_dev_f,_dev_uold);
15. error = xomp_beyond_block_reduction_float(_dev_per_block_error,_num_blocks_,6);
16. /* Copy back (optionally) and deallocate variables mapped within this environment, pop stack */
17. xomp_deviceDataEnvironmentExit();
20. Lawrence Livermore National Laboratory 20
Preliminary results: AXPY (Y=a*X)
Hardware configuration:
• 4 quad-core Intel Xeon processors (16 cores)
2.27GHz with 32GB DRAM.
• NVIDIA Tesla K20c GPU (Kepler architecture)
0
0.5
1
1.5
2
2.5
3
5000 50000 500000 5000000 50000000 100000000 500000000
Vector size (float)
AXPY Execution Time (s)
Sequential
OpenMP FOR (16 threads)
HOMP ACC
PGI OpenACC
HMPP OpenACC
Software configuration:
• HOMP (GCC 4.4.7 and the CUDA 5.0)
• PGI OpenACC compiler version 13.4
• HMPP OpenACC compiler version 3.3.3
21. Lawrence Livermore National Laboratory 21
Matrix multiplication
0
5
10
15
20
25
30
35
40
45
50
128x128 256x256 512x512 1024x1024 2048x2048 4096x4096
Matrix size (float)
Matrix Multiplication Execution Time (s)
Sequential
OpenMP (16 threads)
PGI
HMPP
HOMP
HMPP Collapse
HOMP Collapse
*HOMP-collapse slightly outperforms HMPP-collapse when linearization with round-robin scheduling is used.
22. Lawrence Livermore National Laboratory 22
Jacobi
0
10
20
30
40
50
60
70
80
90
100
128x128 256x256 512x512 1024x1024 2048x2048
Matrix size (float)
Jacobi Execution Time (s)
Sequential
HMPP
PGI
HOMP
OpenMP
HMPP Collapse
HOMP Collpase
23. Lawrence Livermore National Laboratory 23
HOMP: multiple versions of Jacobi
0
10
20
30
40
50
60
70
80
90
100
128x128 256x256 512x512 1024x1024 2048x2048
Matrix size (float)
Jacobi Execution Time (s)
first version
target-data
Loop collapse using linearization with static-even scheduling
Loop collapse using 2-D mapping (16x16 block)
Loop collapse using 2-D mapping (8x32 block)
Loop collapse using linearization with round-robin scheduling
24. Lawrence Livermore National Laboratory 24
Case 1: conclusion
Many fundamental source-to-source translation
• CUDA kernel generation, loop transformation, data
management, reduction, loop scheduling, etc.
• Reusable across CPU and accelerator
Quick & competitive results:
• 1st implementation in the world in 2013
— Interactions with IBM, TI and GCC compiler teams
• Early Feedback to OpenMP committee
— Multiple device support
— Per device default loop scheduling policy
— Productivity: combined directives
— New clause: no-middle-sync
— Global barrier: #pragma omp barrier [league|team]
• Extensions to support multiple GPUs in 2015
25. Lawrence Livermore National Laboratory 25
Case 2: Performance via whole
program autotuning
Problem: increasingly difficult to conduct
performance optimization for large scale
applications
• Hardware: multicore, GPU, FPGA, Cell, Blue Gene
• Empirical optimization (autotuning):
— small kernels, domain-specific libraries only
Our objectives:
• Research into autotuning for optimizing large scale
applications
— sequential and parallel
• Develop tools to automate the life-cycle of autotuning
— Outlining, parameterized transformation (loop tiling, unrolling)
26. Lawrence Livermore National Laboratory 26
An end-to-end whole program
autotuning system
Application
Performance
Tool
Tool
Interface
Code Triage
Performance
Info.
Kernel
Extraction
Kernels
Variants
Search
Engine
Parameterized
Translation
Search Space
ROSE-
Based
Components
KernelsFocus
27. Lawrence Livermore National Laboratory 27
Kernel extraction: outlining
Outlining:
Form a function from a code segment (outlining
target)
Replace the code segment with a call to the function
(outlined function)
A critical step to the success of an end-to-end
autotuning system
• Feasibility:
— whole-program tuning -> kernel tuning
• Relevance
— Ensure semantic correctness
— Preserve performance characteristics
28. Lawrence Livermore National Laboratory 28
Example: Jacobi using a classic
algorithm
// .... some code is omitted
void OUT__1__4027__(int *ip__, int *jp__, double omega,
double *errorp__, double *residp__, double ax, double ay,
double b)
{
for (*ip__=1;*ip__<(n-1);(*ip__)++)
for (*jp__=1;*jp__<(m-1);(*jp__)++)
{
*residp__ = (ax * (uold[*ip__-1][*jp__] +
uold[*ip__+1][*jp__]) + ay * (uold[*ip__][*jp__-1] +
uold[*ip__][*jp__+1]) + b * uold[*ip__][*jp__] -
f[*ip__][*jp__])/b;
u[*ip__][*jp__] = uold[*ip__][*jp__] - omega * (*residp__);
*errorp__ = *errorp__ + (*residp__) * (*residp__);
}
}
void main()
{
int i,j;
double omega,error,resid,ax,ay,b;
//...
OUT__1__4027__(&i,&j,omega,&error,&resid,ax,ay,b);
error = sqrt(error)/(n*m);
//...
}
int n,m;
double
u[MSIZE][MSIZE],f[MSIZE][MSIZE],uold[MSIZE][MSIZE];
void main()
{
int i,j;
double omega,error,resid,ax,ay,b;
// initialization code omitted here
// ...
for (i=1;i<(n-1);i++)
for (j=1;j<(m-1);j++) {
resid = (ax * (uold[i-1][j] + uold[i+1][j])+ ay * (uold[i][j-1]
+ uold[i][j+1])+ b * uold[i][j] - f[i][j])/b;
u[i][j] = uold[i][j] - omega * resid;
error = error + resid*resid;
}
error = sqrt(error)/(n*m);
// verification code omitted here ...
}
• Algorithm: all variables are made parameters,
written variables used via pointer
dereferencing
• Problem: 8 parameters & pointer
dereferences for written variables in C
• Disturb performance
• Impede compiler analysis
29. Lawrence Livermore National Laboratory 29
Our effective outlining algorithm
Rely on source level code analysis
• Scope information: global, local, or in-between
• Liveness analysis: at a given point, if the value’s value will be
used in the future (transferred to future use)
• Side-effect: variables being read or written
Decide two things about parameters
• Minimum number of parameters to be passed
• Minimum pass-by-reference parameters
Parameters = (AllVars − InnerVars − GlobalVars − NamespaceVars) ∩ (LiveInVars U
LiveOutVars)
PassByRefParameters = Parameters ∩ ((ModifiedVars ∩ LiveOutVars) U ArrayVars)
30. Lawrence Livermore National Laboratory 30
Eliminating pointer dereferences
Variables pass-by-reference handled by classic
algorithms: pointer dereferences
We use a novel method: variable cloning
• Check if such a variable is used by address:
— C: &x; C++: T & y=x; or foo(x) when foo(T&)
• Use a clone variable if x is NOT used by address
CloneCandidates = PassByRefParameters ∩ PointerDereferencedVars
CloneVars = (CloneCandidates − UseByAddressVars) ∩ AssignableVars
CloneVarsToInit = CloneVars ∩ LiveInVars
CloneVarsToSave = CloneVars ∩ LiveOutVars
31. Lawrence Livermore National Laboratory 31
Classic vs. effective outlining
Classic algorithm Our effective algorithm
void OUT__1__4027__(int *ip, int *jp, double
omega, double *errorp__, double *residp__,
double ax, double ay, double b)
{
for (*ip__=1;*ip__<(n-1);(*ip__)++)
for (*jp__=1;*jp__<(m-1);(*jp__)++)
{
*residp__ = (ax * (uold[*ip__-1][*jp__] +
uold[*ip__+1][*jp__]) + ay * (uold[*ip__][*jp__-1] +
uold[*ip__][*jp__+1]) + b * uold[*ip__][*jp__] -
f[*ip__][*jp__])/b;
u[*ip__][*jp__] = uold[*ip__][*jp__] - omega *
(*residp__);
*errorp__ = *errorp__ + (*residp__) *
(*residp__);
}
}
void OUT__1__5058__(double omega,double
*errorp__, double ax, double ay, double b)
{
int i, j; /* neither live-in nor live-out*/
double resid ; /* neither live-in nor live-out */
double error ; /* clone for a live-in and live-out
parameter */
error = * errorp__; /* Initialize the clone*/
for (i = 1; i < (n - 1); i++)
for (j = 1; j < (m - 1); j++) {
resid = (ax * (uold[i - 1][j] + uold[i + 1][j]) +
ay * (uold[i][j - 1] + uold[i][j + 1]) +
b * uold[i][j] - f[i][j]) / b;
u[i][j] = uold[i][j] - omega * resid;
error = error + resid * resid;
}
*errorp__ = error; /* Save value of the clone*/
}
33. Lawrence Livermore National Laboratory 33
Whole program autotuning example
The Outlined SMG 2000 kernel (simplified version)
for (si = 0; si < stencil_size; si++)
for (k = 0; k < hypre__mz; k++)
for (j = 0; j < hypre__my; j++)
for (i = 0; i < hypre__mx; i++)
rp[ri + i + j * hypre__sy3 + k * hypre__sz3] -=
Ap_0[i + j * hypre__sy1 + k * hypre__sz1 + A->data_indices[m][si]] *
xp_0[i + j * hypre__sy2 + k * hypre__sz2 + dxp_s[si]];
SMG (semicoarsing multigrid solver) 2000
• 28k line C code, stencil computation
• a kernel ~45% execution time for 120x120x129 data set
HPCToolkit (Rice), the GCO search engine (UTK)
35. Lawrence Livermore National Laboratory 35
Case 2 conclusion
Performance optimization via whole program
tuning
• Key components: outliner, parameterized
transformation (skipped in this talk)
• ROSE outliner: improved algorithm based on source
code analysis
— Preserve performance characteristics
— Facilitate other tools to further process the kernels
Difficult to automate
• Code triage: The right scope of a kernel, specification
• Search space: Optimization/configuration for each kernel
36. Lawrence Livermore National Laboratory 36
Case 3: resilience via source-level
redundancy
Exascale systems expected to use large-amount
of components running at low-power
• Exa- 1018 / 2GHz (109) per core 109 core
• Increased sensitivity to external events
• Transient faults caused by external events
— Faults which don’t stop execution but silently corrupt results.
— Compared to permanent faults
Streamlined and simple processors
• Cannot afford pure hardware-based resilience
• Solution: software-implemented hardware fault
tolerance
37. Lawrence Livermore National Laboratory 37
Approach: compiler-based
transformations to add resilience
Source code annotation(pragmas)
• What to protect
• What to do when things go wrong
Source-to-source translator
• Fault detection: Triple-Modular
Redundancy (TMR)
• Implement fault-handling policies
• Vendor compiler: binary executable
• Intel PIN: fault injection
Challenges
Overhead
Vendor compiler optimizations
ROSE::FTTransform
Vendor
Compiler
Annotated
source code
Transformed
source code
Binary
executable
Fault Injection
(Intel PIN)
39. Lawrence Livermore National Laboratory 39
Source code redundancy:
reduce overhead
Relying on baseline double
modular redundancy (DMR)
to help reduce overhead
Statement to be protected
40. Lawrence Livermore National Laboratory 40
Source code redundancy: bypass
compiler optimizations
Using an extra pointer to
help preserve source code
redundancy
Statement to be protected
41. Lawrence Livermore National Laboratory 41
Results: necessity and effectiveness
of optimizer-proof code redundancy
Transformation Method PAPI_FP_INS
(O1)
PAPI_FP_INS
(O2)
PAPI_FP_INS
(O3)
Our method of using
pointers
Doubled Doubled Doubled
Naïve Duplication No Change No Change No Change
Hardware counter to check if redundant computation can
survive compiler optimizations
• PAPI (PAPI_FP_INS) to collect the number of floating point
instructions for the protected kernel
• GCC 4.3.4, O1 to O3
42. Lawrence Livermore National Laboratory 42
Results: overhead
Jacobi kernel:
• Overhead: 0% to 30%
• Minimum overhead
— Stride=8
— Array size 16Kx16K
— Double precision
• The more original
latency, the
less overhead of
added redundancy
• Livermore kernel
— Kernel 1 (Hydro fragment) – 20%
— Kernel 4 (Banded linear equations) – 40%
— Kernel 5 (Tri-diagonal elimination) – 26%
— Kernel 11 (First sum) – 2%
43. Lawrence Livermore National Laboratory 43
Conclusion
HPC challenges
• Traditional: performance, productivity, portability
• Newer: Heterogeneity, resilience, power efficiency
• Made much worse by increasing complexity of computation node
architectures
Source-to-source compilation techniques
• Complement traditional compilers aimed for generating machine
code
— Source code analysis, transformation, optimizations, generation, …
• High-level Programming models for GPUs
• Performance tuning: code extraction and parameterized code
translation/generation
• Resilience via source level redundancy
44. Lawrence Livermore National Laboratory 44
Future directions
Accelerator optimizations
• Newer accelerator features: Direct GPU to GPU communication,
Peer-to-Peer execution model, Uniform Virtual Memory
• CPU threads vs. accelerator threads vs. vectorization
Knowledge-driven compilation and runtime adaptation
• Explicit represent optimization knowledge and automatic apply
them in HPC
A uniform directive-based programming model
• A single X instead of MPI+X to cover both on-node and across
node parallelism
Thread level resilience
• Most resilience work focus on MPI, not threading
45. Lawrence Livermore National Laboratory 45
Acknowledgement
Sponsors
• LLNL (LDRD, ASC), DOE (SC), DOD
Colleagues
• Dan Quinlan (ROSE project lead)
• Pei-Hung Lin (Postdoc)
Collaborators
• Mary Hall (University of Utah)
• Qing Yi (University of Colorado at Colorado Springs)
• Robert Lucas (University of Southern California-ISI)
• Sally A. McKee (Chalmers University of Technology)
• Stephen Guzik (Colorado State University)
• Xipeng Shen (North Carolina State University)
• Yonghong Yan (Oakland University)
• ….
Editor's Notes
Parallel scientific computing in climate modeling, weather prediction, energy, material science, etc.
Traditional compilers: GCC, ICC, Open64
Aimed to generate machine executables
Lose high level source code information
Black-box to application developers
Multiple backends for multiple platforms
A source-to-source compiler infrastructure
Generate analyzed and/or transformed source code
Maintain high-level source structure and information (comments, preprocessing info., etc.)
Friendly to application developers: easy to understand
Portable across multiple platforms by leveraging backend compilers
ROSE: making compiler techniques more accessible
Source-to-source compiler infrastructure
High level intermediate representation (Abstract Syntax Tree -AST) with rich traverse/transformation interfaces
Supports C/C++, Fortran, OpenMP, UPC
Encapsulates sophisticated compiler analyses/optimizations into easy-to-use function calls
Intended users: building customized program analysis and translation tools
The ROSE IR consists of the AST tree, a set of associated symbol tables, and a rich set of interface functions.
As a source-to-source compiler framework, the ROSE AST is a high level representation which is designed to preserves source code details, such as token stream, source comments, C preprocessor info and C++ templates.
Rich interface for: AST traversal, query, creation, copy, symbol lookup, Generic analyses, transformations,
It is also extensible, which means users can attach arbitrary information to AST, via a mechanism called persistent attributes.
As we can see in this slide, a program’s source code has a very straightforward and faithful tree representation in ROSE. As a result, ROSE is ideal for quickly building customized analysis and translation tools to support source code analysis and restructuring.
Discover/design/develop building blocks in the software stack
Language level
target, device(), map(), …
num_devices, device_type(), no-middle-sync, …
Compiler level
Pragma parsing
Parser building blocks
AST extensions
CUDA-specific IR nodes
AST translation: SageBuilder and SageInterface functions
Loop normalization, multi-dimensional array linearization
Outliner: CUDA kernel generation
Runtime level
Device feature probe: threads per block, block number…
Memory management: allocation, copy, data tracking/reuse, etc.
Loop scheduling: within and across thread blocks
Reduction: a first two-level algorithm
Use the latest official release for key concepts : one slide primer
Language level
target, device(), map(), …
num_device, device_type(), no-middle-sync, …
Compiler level
Pragma parsing
Parser building blocks
AST extensions
CUDA-specific IR nodes
AST translation: SageBuilder and SageInterface functions
Loop normalization, multi-dimensional array linearization
Outliner: CUDA kernel generation
Runtime level
Device feature probe: threads per block, block number…
Memory management: allocation, copy, data tracking/reuse, etc.
Loop scheduling: within and across thread blocks
Reduction: a first two-level algorithm
Language features
Target regions
Parallel regions and teams
Data mapping
Loop and distribute directives
C bindings: interoperate with C/C++ and Fortran
Conditional bodies: support both CUDA and OpenCL(ongoing working)
Language level
target, device(), map(), …
num_device, device_type(), no-middle-sync, …
Compiler level
Pragma parsing
Parser building blocks
AST extensions
CUDA-specific IR nodes
AST translation: SageBuilder and SageInterface functions
Loop normalization, multi-dimensional array linearization
Outliner: CUDA kernel generation
Runtime level
Device feature probe: threads per block, block number…
Memory management: allocation, copy, data tracking/reuse, etc.
Loop scheduling: within and across thread blocks
Reduction: a first two-level algorithm
*C. Liao, D. J. Quinlan, T. Panas, and B. R. de Supinski, “A ROSE-based OpenMP 3.0 research compiler supporting multiple runtime libraries,”
in Beyond loop level parallelism in OpenMP: accelerators, tasking and more (IWOMP’10), Springer, 2010, pp. 15-28.
Three points
Start with traditional parallel fro
Add first target directive with map clauses
To avoid repetitive data allocation, copying, target data at the while loop level
4 points:
Outliner extended to generate CUDA kernel
Invoke runtime scheduler
Array linearization
Inner block reduction
For the call site: data preparation, configuration, launch, cleanup,
Runtime support for tracking data environment: a stack of data structure storing data allocated in each level
Runtime data allocation, registering,
Execution configuration,
CPU side reduction
Points
* Software hardware configuration
* PGI OpenACC , HMPP OpenACC
Accelerator is not always a better choice
void axpy_ompacc(REAL* x, REAL* y, int n, REAL a) {
int i;
#pragma omp target device (gpu0) map(tofrom: y[0:n]) map(to: x[0:n],a,n)
#pragma omp parallel for
for (i = 0; i < n; ++i)
y[i] += a * x[i];
}
PGI collapse version does not give any performance improvements
i-j-k order, scalar replacement
PGI collapse does not work with reduction
The Heterogeneous OpenMP (HOMP) programming model developed by using the building blocks of this project has delivered competitive performance for the important stencil computation kernel, Jacobi. This figure shows the improvements of execution time by incrementally using more advanced compiler and runtime building blocks. The original version uses minimum set of building blocks. The target-data version uses data-reuse. The rest versions use loop collapse, 2-D mapping, and different scheduling policies.
Current: device(id), portability concern
Desired: device_type(), num_devices(), dist_data()
Avoid cumbersome explicit teams
May need per device default scheduling policy
omp target parallel: avoid initial implicit sequential task
Our work is motivated by a grand challenge problem today:
With the radically different and diverse platforms, such as muticore, GPU, FPGA,etc.
building conventional compiler optimizations for each of them is no longer an effective solution.
On the other hand, empirical tuning seems to be an very promising approach since it usually does not rely on accurate machine models.
Unfortunately, most existing autotuning work only handle small kernels, domain-specific libraries.
Thus, our goal of this work is to study what it takes to extend autotuning to handle general large scale applications.
In particular, we want to build automated tools to support empirical tuning of whole applications.
------------------
%It mostly chooses the best optimization based on actual measured performance regardless hardware details.
%building model-driven static compiler optimizations for them is no longer an effective solution. On the other hand,
It is difficult to accurately model each of the complex platform.
Static compiler optimizations: limited success
Software: C/C++/Fortran, MPI, OpenMP, UPC …
source-to-source
A classic algorithm will generate outlined kernel like this:
Written variables : pass their addresses, use them via pointer dereferences
There are at least two serious problems with the class outlining algorithm.
One is that it may change the performance characteristics of the transformed application. Pointer-dereferencing often is more expensive than straight forward non-pointer typed variables.
The other problem is that it significantly impede static analysis due to the well know limitations of accurate pointer analysis.
Consequently, the kernels with excessive pointer references are hard to be further analyzed and transformed to generate multiple variants.
From our experience, none of the scidac peri tools can handle the kernel mentioned above. So it is a serious problem.
------------------
Hard to be optimized , difficult to be handled by other tools (parameterized translation).
%This slide shows the jacobi program after its loop computation kernel is outlined using a classic algorithm.
As to the parameters, we use variable scope, liveness, and side effect analysis to minimize the number of parameters to be passed.
Further, we decide how to pass each parameter, either by value or by reference, based on the formula here.
Modified and liveout: need to pass the modified values, plus array for language limitation and class/structure variables for efficiency
Parameters = ((AllVars − InnerVars − GlobalVars − NamespaceVars −ClassVars) ∩ (LiveInVars U LiveOutVars)) U ClassPointers
PassByRefParameters = Parameters ∩ ((ModifiedVars ∩ LiveOutVars) U ArrayVars U ClassVars)
Once the outliner ensures that a target can be correctly outlined.
In conclusion,
manual code specification/translation may be needed
Second-chance: retry until reaching the predefined threshold, then use next-policy
Final-wish: only one chance to handle the fault using next-policy, you can inject a statement before the policy is called
* Jacob Lidman, Daniel J. Quinlan, Chunhua Liao and Sally A. McKee, ROSE::FTTransform A Source-to-Source Translation Framework for Exascale Fault-Tolerance Research, Fault-Tolerance for HPC at Extreme Scale (FTXS 2012), Boston, June 25-28, 2012.
It is obvious that this overhead is heavily influenced by
the memory latencies of the original code: the more latency,
the better hidden the cost of the duplicated computation. The
minimum overhead is observed when the Jacobi iteration uses
a non-consecutive stride, a large data set, and a large element
size.