Automatic Task-based Code Generation for High Performance DSELJoel Falcou
Providing high level tools for parallel programming while sustaining a high level of performance has been a challenge that techniques like Domain Specific Embedded Languages try to solve. In previous works, we investigated the design of such a DSEL – NT2 – providing a Matlab -like syntax for parallel numerical computations inside a C++ library.
Main issues addressed here is how liimtaions of classical DSEL generation and multithreaded code generation can be overcome.
Specifying function specializations over an arbitrary set of type constraints is a daunting task in C++ as soon as those constraints become more and more complex and/or grow in number. Various idioms are traditionally used to solve this problem: SFINAE or Tag Dispatching for example.
This talk introduces Boost.Dispatch, an infrastructure library that make Tag Dispatching easier to use and maintain by providing a protocol to define Tags and relationship between them, to map an arbitrary set of tags to a given function implementation and to extend said list of specialization in an open, modular way. The main new asset of Boost.Dispatch is the ability to use categorization of function properties and/or architectural information to guide the dispatch in addition to the more traditional use of type properties.
The talk will quickly brushes a picture of what SFINAE, overloading and Tag Dispatching mean in C++ and what are their limitations. We’ll introduce Boost.Dispatch over some examples ranging from simple library design to actual high performance computing code using the library to select best implementation of a function based on non-trivial architecture dependent information. Then we’ll dive into the implementation of the library and try to sketches the upcoming challenges yet to be solved.
(Costless) Software Abstractions for Parallel ArchitecturesJoel Falcou
Performing large, intensive or non-trivial computing on array like data structures is one of the most common task in scientific computing, video game development and other fields. This matter of fact is backed up by the large number of tools, languages and libraries to perform such tasks. If we restrict ourselves to C++ based solutions, more than a dozen such libraries exists from BLAS/LAPACK C++ binding to template meta-programming based Blitz++ or Eigen. If all of these libraries provide good performance or good abstraction, none of them seems to fit the need of so many different user types.
Moreover, as parallel system complexity grows, the need to maintain all those components quickly become unwieldy. This talk explores various software design techniques - like Generative Programming, MetaProgramming and Generic Programming - and their application to the implementation of a parallel computing librariy in such a way that:
- abstraction and expressiveness are maximized - cost over efficiency is minimized
We'll skim over various applications and see how they can benefit from such tools. We will conclude by discussing what lessons were learnt from this kind of implementation and how those lessons can translate into new directions for the language itself.
HDR Defence - Software Abstractions for Parallel ArchitecturesJoel Falcou
Performing large, intensive or non-trivial computing on array like data
structures is one of the most common task in scientific computing, video game
development and other fields. This matter of fact is backed up by the large number
of tools, languages and libraries to perform such tasks. If we restrict ourselves to
C++ based solutions, more than a dozen such libraries exists from BLAS/LAPACK
C++ binding to template meta-programming based Blitz++ or Eigen.
If all of these libraries provide good performance or good abstraction, none of
them seems to fit the need of so many different user types. Moreover, as parallel
system complexity grows, the need to maintain all those components quickly
become unwieldy. This thesis explores various software design techniques - like
Generative Programming, MetaProgramming and Generic Programming - and their
application to the implementation of various parallel computing libraries in such a
way that abstraction and expressiveness are maximized while efficiency overhead is
minimized.
The Goal and The Journey - Turning back on one year of C++14 MigrationJoel Falcou
C++14 has been announced as the next best thing since sliced bread in terms of simplicity, performance and overall elegance of c++ code. This talk is the story of why and how we decided to migrate one of our old 'modern C++' software library -- BSP++, a C++ implementation of the BSP parallel programming model -- to C++14.
More than just a recollection of 'use this' or 'do that' mottos, this talk will try to ponder on :
why one should consider migrating to C++14 now
which features actually helped and which one did not
the traps and pitfalls compilers tried to pull on us
Automatic Task-based Code Generation for High Performance DSELJoel Falcou
Providing high level tools for parallel programming while sustaining a high level of performance has been a challenge that techniques like Domain Specific Embedded Languages try to solve. In previous works, we investigated the design of such a DSEL – NT2 – providing a Matlab -like syntax for parallel numerical computations inside a C++ library.
Main issues addressed here is how liimtaions of classical DSEL generation and multithreaded code generation can be overcome.
Specifying function specializations over an arbitrary set of type constraints is a daunting task in C++ as soon as those constraints become more and more complex and/or grow in number. Various idioms are traditionally used to solve this problem: SFINAE or Tag Dispatching for example.
This talk introduces Boost.Dispatch, an infrastructure library that make Tag Dispatching easier to use and maintain by providing a protocol to define Tags and relationship between them, to map an arbitrary set of tags to a given function implementation and to extend said list of specialization in an open, modular way. The main new asset of Boost.Dispatch is the ability to use categorization of function properties and/or architectural information to guide the dispatch in addition to the more traditional use of type properties.
The talk will quickly brushes a picture of what SFINAE, overloading and Tag Dispatching mean in C++ and what are their limitations. We’ll introduce Boost.Dispatch over some examples ranging from simple library design to actual high performance computing code using the library to select best implementation of a function based on non-trivial architecture dependent information. Then we’ll dive into the implementation of the library and try to sketches the upcoming challenges yet to be solved.
(Costless) Software Abstractions for Parallel ArchitecturesJoel Falcou
Performing large, intensive or non-trivial computing on array like data structures is one of the most common task in scientific computing, video game development and other fields. This matter of fact is backed up by the large number of tools, languages and libraries to perform such tasks. If we restrict ourselves to C++ based solutions, more than a dozen such libraries exists from BLAS/LAPACK C++ binding to template meta-programming based Blitz++ or Eigen. If all of these libraries provide good performance or good abstraction, none of them seems to fit the need of so many different user types.
Moreover, as parallel system complexity grows, the need to maintain all those components quickly become unwieldy. This talk explores various software design techniques - like Generative Programming, MetaProgramming and Generic Programming - and their application to the implementation of a parallel computing librariy in such a way that:
- abstraction and expressiveness are maximized - cost over efficiency is minimized
We'll skim over various applications and see how they can benefit from such tools. We will conclude by discussing what lessons were learnt from this kind of implementation and how those lessons can translate into new directions for the language itself.
HDR Defence - Software Abstractions for Parallel ArchitecturesJoel Falcou
Performing large, intensive or non-trivial computing on array like data
structures is one of the most common task in scientific computing, video game
development and other fields. This matter of fact is backed up by the large number
of tools, languages and libraries to perform such tasks. If we restrict ourselves to
C++ based solutions, more than a dozen such libraries exists from BLAS/LAPACK
C++ binding to template meta-programming based Blitz++ or Eigen.
If all of these libraries provide good performance or good abstraction, none of
them seems to fit the need of so many different user types. Moreover, as parallel
system complexity grows, the need to maintain all those components quickly
become unwieldy. This thesis explores various software design techniques - like
Generative Programming, MetaProgramming and Generic Programming - and their
application to the implementation of various parallel computing libraries in such a
way that abstraction and expressiveness are maximized while efficiency overhead is
minimized.
The Goal and The Journey - Turning back on one year of C++14 MigrationJoel Falcou
C++14 has been announced as the next best thing since sliced bread in terms of simplicity, performance and overall elegance of c++ code. This talk is the story of why and how we decided to migrate one of our old 'modern C++' software library -- BSP++, a C++ implementation of the BSP parallel programming model -- to C++14.
More than just a recollection of 'use this' or 'do that' mottos, this talk will try to ponder on :
why one should consider migrating to C++14 now
which features actually helped and which one did not
the traps and pitfalls compilers tried to pull on us
This is Work-In-Progress. Developing a series of lectures on C++0x. This will augment my presentations on C++ and Design Pattern. First trial run was done at Interra, Noida in 2009
Contains C programming tutorial for beginners with lot of examples explained. This tutorial contains each and every feature of C programming that will help you. C programming tutorial covering basic C Programming examples, data types, functions, loops, arrays, pointers, etc.
Programming is hard. Programming correct C and C++ is particularly hard. Indeed, both in C and certainly in C++, it is uncommon to see a screenful containing only well defined and conforming code.Why do professional programmers write code like this? Because most programmers do not have a deep understanding of the language they are using.While they sometimes know that certain things are undefined or unspecified, they often do not know why it is so. In these slides we will study small code snippets in C and C++, and use them to discuss the fundamental building blocks, limitations and underlying design philosophies of these wonderful but dangerous programming languages.
This content has a CC license. Feel free to use it for whatever you want. You may download the original PDF file from: http://www.pvv.org/~oma/DeepC_slides_oct2012.pdf
This is Work-In-Progress. Developing a series of lectures on C++0x. This will augment my presentations on C++ and Design Pattern. First trial run was done at Interra, Noida in 2009
Contains C programming tutorial for beginners with lot of examples explained. This tutorial contains each and every feature of C programming that will help you. C programming tutorial covering basic C Programming examples, data types, functions, loops, arrays, pointers, etc.
Programming is hard. Programming correct C and C++ is particularly hard. Indeed, both in C and certainly in C++, it is uncommon to see a screenful containing only well defined and conforming code.Why do professional programmers write code like this? Because most programmers do not have a deep understanding of the language they are using.While they sometimes know that certain things are undefined or unspecified, they often do not know why it is so. In these slides we will study small code snippets in C and C++, and use them to discuss the fundamental building blocks, limitations and underlying design philosophies of these wonderful but dangerous programming languages.
This content has a CC license. Feel free to use it for whatever you want. You may download the original PDF file from: http://www.pvv.org/~oma/DeepC_slides_oct2012.pdf
Creative coding in art education -Fads presentationTomi Dufva
Slides from my presentation "Creative coding in art education" which I held in Pyhätunturi, Finland at FADS symposium. More details about my presentation can be found at my blog: http://www.thispagehassomeissues.com/blog/2014/11/5/creative-coding-in-art-education-presentation-at-fads-2014
Creative Coding in Interaction Design with Tim StuttsFITC
Creative Coding in Interaction Design
with Tim Stutts
OVERVIEW
Creative coding is a practice that is infused in everything from programming 3D-printed furniture to generative, motion graphics for a commercial–essentially any place where design and development can overlap into a singular, art-directed process. But what is its place in the interaction design (UI/UX) field within the highly requirement-driven software industry? Can raw programmatic exploration for the sake of ideation amount to great, usable end-products? As interaction design touches on applications with increasingly advanced, off-screen technologies, traditional deliverables such as wireframes and user-flows in themselves can distance the designer from the technology and fail to fully explore the combined potential of the human and the application. On the other extreme, a designer may choose to work directly with API’s, but find themselves in over their head. The solution and middle ground is the creative coding platform.
Presented at FITC Toronto 2014 on April 27-29, 2014
More info at www.FITC.ca
Source-to-source transformations: Supporting tools and infrastructurekaveirious
Introduction to source-to-source transformation. Concept and overview. Basics of existing tools (TXL, ROSE, Cetus, EDG, C-to-C, Memphis); pros and cons. Part of an internal evaluation for selecting a source-to-source transformation tool.
An overview of software engineering project (MSc Thesis) to objectively evaluate the benefits and drawbacks of using Model-Driven Engineering tools on Eclipse Modelling Framework (EMF) such as Epsilon and Xtext to implement a compiler for a sufficiently complex intermediate language. The results seem promising.
For more imnformation, please see this blog post:
http://modeling-languages.com/re-implementing-apache-thrift-with-mde/
Concurrent Matrix Multiplication on Multi-core ProcessorsCSCJournals
With the advent of multi-cores every processor has built-in parallel computational power and that can only be fully utilized only if the program in execution is written accordingly. This study is a part of an on-going research for designing of a new parallel programming model for multi-core architectures. In this paper we have presented a simple, highly efficient and scalable implementation of a common matrix multiplication algorithm using a newly developed parallel programming model SPC3 PM for general purpose multi-core processors. From our study it is found that matrix multiplication done concurrently on multi-cores using SPC3 PM requires much less execution time than that required using the present standard parallel programming environments like OpenMP. Our approach also shows scalability, better and uniform speedup and better utilization of available cores than that the algorithm written using standard OpenMP or similar parallel programming tools. We have tested our approach for up to 24 cores with different matrices size varying from 100 x 100 to 10000 x 10000 elements. And for all these tests our proposed approach has shown much improved performance and scalability
.NET Core, ASP.NET Core Course, Session 3aminmesbahi
Session 3,
Introducing to Compiler
What is the LLVM?
LLILC
RyuJIT
AOT Compilation
Preprocessors and Conditional Compilation
An Overview on Dependency Injection
Nowadays software systems are essential to the environment of most organizations, and their maintenance is a key point to support business dynamics. Thus, reverse engineering legacy systems for knowledge reuse has become a major concern in software industry. This article, based on a survey about reverse engineering tools, discusses a set of functional and nonfunctional requirements for an effective tool for reverse engineering, and observes that current tools only partly support these requirements. In addition, we define new requirements, based on our group’s experience and industry feedback, and present the architecture and implementation of LIFT: a Legacy InFormation retrieval Tool, developed based on these demands. Furthermore, we discuss the compliance of LIFT with the defined requirements. Finally, we applied the LIFT in a reverse engineering project of a 210KLOC NATURAL/ADABAS system of a financial institution and analyzed its effectiveness and scalability, comparing data with previous similar projects performed by the same institution.
Building and deploying LLM applications with Apache AirflowKaxil Naik
Behind the growing interest in Generate AI and LLM-based enterprise applications lies an expanded set of requirements for data integrations and ML orchestration. Enterprises want to use proprietary data to power LLM-based applications that create new business value, but they face challenges in moving beyond experimentation. The pipelines that power these models need to run reliably at scale, bringing together data from many sources and reacting continuously to changing conditions.
This talk focuses on the design patterns for using Apache Airflow to support LLM applications created using private enterprise data. We’ll go through a real-world example of what this looks like, as well as a proposal to improve Airflow and to add additional Airflow Providers to make it easier to interact with LLMs such as the ones from OpenAI (such as GPT4) and the ones on HuggingFace, while working with both structured and unstructured data.
In short, this shows how these Airflow patterns enable reliable, traceable, and scalable LLM applications within the enterprise.
https://airflowsummit.org/sessions/2023/keynote-llm/
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Modern design is crucial in today's digital environment, and this is especially true for SharePoint intranets. The design of these digital hubs is critical to user engagement and productivity enhancement. They are the cornerstone of internal collaboration and interaction within enterprises.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Designing for Privacy in Amazon Web ServicesKrzysztofKkol1
Data privacy is one of the most critical issues that businesses face. This presentation shares insights on the principles and best practices for ensuring the resilience and security of your workload.
Drawing on a real-life project from the HR industry, the various challenges will be demonstrated: data protection, self-healing, business continuity, security, and transparency of data processing. This systematized approach allowed to create a secure AWS cloud infrastructure that not only met strict compliance rules but also exceeded the client's expectations.
1. Software Abstractions for Parallel Architectures
Joel Falcou
LRI - CNRS - INRIA
HDR Thesis Defense
12/01/2014
2. The Paradigm Change in Science
From Experiments to Simulations
Simulations is now an integral
part of the Scientic Method
Scientic Computing enables
larger, faster, more accurate
Research
Fast Simulation is Time Travel as
scientic results are now more
readily available
Local Galaxy Cluster Simulation - Illustris project
Computing is rst and foremost a mainstream science tool
2 of 41
3. The Paradigm Change in Science
The Parallel Hell
Heat Wall: Growing cores
instead of GHz
Hierarchical and heterogeneous
parallel systems are the norm
The Free Lunch is over as
hardware complexity rises faster
than the average developer skills
Local Galaxy Cluster Simulation - Illustris project
The real challenge in HPC is the Expressiveness/Efficiency War
2 of 41
4. The Expressiveness/Efficiency War
Single Core Era
Performance
Expressiveness
C/Fort.
C++
Java
Multi-Core/SIMD Era
Performance
Sequential
Expressiveness
SIMD
Threads
Heterogenous Era
Performance
Sequential
Expressiveness
GPU
Phi
SIMD
Threads
Distributed
As parallel systems complexity grows, the expressiveness gap turns into an ocean
3 of 41
5. Designing tools for Scientic Computing
Objectives
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
4 of 41
6. Designing tools for Scientic Computing
Objectives
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
Design tools as C++ libraries (1)
Design these libraries as Domain Specic Embedded Languages (DSEL) (2+3)
Use Parallel Programming Abstractions as parallel components (4)
Use Generative Programming to deliver performance (5)
4 of 41
8. Why Parallel Programming Models ?
Limits of regular tools
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Contribute to the Expressiveness Gap
6 of 41
9. Why Parallel Programming Models ?
Limits of regular tools
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Contribute to the Expressiveness Gap
Available Models
Performance centric: P-RAM, LOG-P, BSP
Pattern centric: Futures, Skeletons
Data centric: HTA, PGAS
6 of 41
10. Why Parallel Programming Models ?
Limits of regular tools
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Contribute to the Expressiveness Gap
Available Models
Performance centric: P-RAM, LOG-P, BSP
Pattern centric: Futures, Skeletons
Data centric: HTA, PGAS
6 of 41
11. Bulk Synchronous Parallelism [Valiant, McColl 90]
Principles
Machine Model
Execution Model
Analytic Cost Model
C
o
m
p
u
t
e
B
a
r
r
i
e
r
C
o
m
m
Wmax h.g
P0
P1
P2
P3
Superstep T Superstep T+1
Wmax h.g L
BSP Execution Model
7 of 41
12. Bulk Synchronous Parallelism [Valiant, McColl 90]
Advantages
Simple set of primitives
Implementable on any
kind of hardware
Possibility to reason about
BSP programs
C
o
m
p
u
t
e
B
a
r
r
i
e
r
C
o
m
m
Wmax h.g
P0
P1
P2
P3
Superstep T Superstep T+1
Wmax h.g L
BSP Execution Model
7 of 41
13. Parallel Skeletons [Cole 89]
Principles
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as a combination of such patterns
Functional point of view
Skeletons are Higher-Order Functions
Skeletons support a compositionnal semantic
Applications become composition of state-less functions
8 of 41
14. Parallel Skeletons [Cole 89]
Principles
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as a combination of such patterns
Classical Skeletons
Data parallel: map, fold, scan
Task parallel: par, pipe, farm
More complex: Distribuable Homomorphism, Divide Conquer, …
8 of 41
15. Relevance to our Objectives
Why using Parallel Skeletons ?
Write code independant of parallel programming minutiae
Composability supports hierarchical architectures
Code is scalable and easy to maintain
Why using BSP ?
Cost model guide development
Few primitives mean that intellectual burden is low
Good medium for developping skeletons
How to ensure performance of those models’ implementations ?
9 of 41
16. Domain Specic Embedded Languages
Domain Specic Languages
Non-Turing complete declarative languages
Solve a single type of problems
Express what to do instead of how to do it
E.g: SQL, M, M, …
From DSL to DSEL [Abrahams 2004]
A DSL incorporates domain-specic notation, constructs, and abstractions as
fundamental design considerations.
A Domain Specic Embedded Languages (DSEL) is simply a library that meets the
same criteria
Generative Programming is one way to design such libraries
10 of 41
17. Generative Programming [Eisenecker 97]
Domain Specific
Application Description
Generative Component Concrete Application
Translator
Parametric
Sub-components
11 of 41
18. Meta-programming as a Tool
Denition
Meta-programming is the writing of computer programs that analyse, transform and
generate other programs (or themselves) as their data.
Meta-programmable Languages
metaOCAML : runtime code generation via code quoting
Template Haskell : compile-time code generation via templates
C++ : compile-time code generation via templates
C++ meta-programming
Relies on the Turing-complete C++ sub-language
Handles types and integral constants at compile-time
classes and functions act as code quoting
12 of 41
19. The Expression Templates Idiom
Principles
Relies on extensive operator
overloading
Carries semantic information
around code fragment
Introduces DSLs without
disrupting dev. chain
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
exprassign
,exprmatrix
,exprplus
, exprcos
,exprmatrix
, exprmultiplies
,exprmatrix
,exprmatrix
(x,a,b);
+
=
cos *
a b a
x
#pragma omp parallel for
for(int j=0;jh;++j)
{
for(int i=0;iw;++i)
{
x(j,i) = cos(a(j,i))
+ ( b(j,i)
* a(j,i)
);
}
}
Arbitrary Transforms applied
on the meta-AST
General Principles of Expression Templates
13 of 41
20. The Expression Templates Idiom
Advantages
Generic implementation becomes
self-aware of optimizations
API abstraction level is arbitrary
high
Accessible through high-level
tools like B.P
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
exprassign
,exprmatrix
,exprplus
, exprcos
,exprmatrix
, exprmultiplies
,exprmatrix
,exprmatrix
(x,a,b);
+
=
cos *
a b a
x
#pragma omp parallel for
for(int j=0;jh;++j)
{
for(int i=0;iw;++i)
{
x(j,i) = cos(a(j,i))
+ ( b(j,i)
* a(j,i)
);
}
}
Arbitrary Transforms applied
on the meta-AST
General Principles of Expression Templates
13 of 41
21. Our Contributions
Our Strategy
Applies DSEL generation techniques to parallel programming
Maintains low cost of abstractions through meta-programming
Maintains abstraction level via modern library design
Our contributions
Tools Pub. Scope Applications
Quaff ParCo’06 MPI Skeletons Real-time 3D reconstruction
SkellPU PACT’08 Skeleton on Cell BE Real-time Image processing
BSP++ IJPP’12 MPI/OpenMP BSP Bioinformatics, Model Checking
NT2 JPDC’14 Data Parallel Matlab Fluid Dynamics, Vision
14 of 41
22. Example of BSP++ Application
Khaled Hamidouche PHD 2008-2011 in collab. with Univ. Brasilia
BSP Smith Waterman
SW computes DNA sequences alignment
BSP++ implementation was written once and run on 7 different hardwares
Efficiency of 95+% even on 6000 cores super-computer
Platform MaxSize # Elements Speedup GCUPs
cluster (MPI) 1,072,950 128 cores 73x 6.53
cluster (MPI/OpenMP) 1,072,950 128 cores 116x 10.41
OpenMP 1,072,950 16 cores 16x 0.40
CellBE 85,603 8 SPEs — 0.14
cluster of CellBEs 85,603 24 SPEs (8:24) 2.8x 0.37
Hopper(MPI) 5,303,436 3072 cores 260x 3.09
Hopper(MPI+OpenMP) 24,894,269 6144 cores 5664x 15,5
15 of 41
23. Second Look at our Contributions
Development Limitations
DSELs are mostly tied to the domain model
Architecture support is often an afterthought
Extensibility is difficult as many refactoring are required per architecture
Example : No proper way to support GPUs with those implementation techniques
16 of 41
24. Second Look at our Contributions
Development Limitations
DSELs are mostly tied to the domain model
Architecture support is often an afterthought
Extensibility is difficult as many refactoring are required per architecture
Example : No proper way to support GPUs with those implementation techniques
Proposed Method
Extends Generative Programming to take this architecture into account
Provides an architecture description DSEL
Integrates this description in the code generation process
16 of 41
26. Software refactoring
Tools Issues Changes
Quaff Raw skeletons API Re-engineered as part of NT2
SkellPU Too architecture specic Re-engineered as part of NT2
BSP++ Integration issues Integrate hybrid code generation
NT2 Not easily extendable Integrate Quaff Skeleton models
Boost.SIMD - Side product of NT2 restructuration
Conclusion
Skeletons are ne as parallel middleware
Model based abstractions are not high level enough
For low level architectures, the simplest model is often the best
18 of 41
27. Boost.SIMD
Pierre Estérie PHD 2010-2014
Principles
Provides simple C++ API over SIMD
extensions
Supports every Intel, PPC and ARM
instructions sets
Fully integrates with modern C++
idioms
Sparse Tridiagonal Solver - collaboration with M. Baboulin and Y. wang
19 of 41
29. The Numerical Template Toolbox
Pierre Estérie PHD 2010-2014
NT2 as a Scientic Computing Library
Provides a simple, M-like interface for users
Provides high-performance computing entities and primitives
Is easily extendable
Components
Uses Boost.SIMD for in-core optimizations
Uses recursive parallel skeletons
Supports task parallelism through Futures
21 of 41
30. The Numerical Template Toolbox
Principles
tableT,S is a simple, multidimensional array object that exactly
mimics M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values
as in M
22 of 41
31. The Numerical Template Toolbox
Principles
tableT,S is a simple, multidimensional array object that exactly
mimics M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values
as in M
How does it works
Take a .m le, copy to a .cpp le
22 of 41
32. The Numerical Template Toolbox
Principles
tableT,S is a simple, multidimensional array object that exactly
mimics M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values
as in M
How does it works
Take a .m le, copy to a .cpp le
Add #include nt2/nt2.hpp and do cosmetic changes
22 of 41
33. The Numerical Template Toolbox
Principles
tableT,S is a simple, multidimensional array object that exactly
mimics M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values
as in M
How does it works
Take a .m le, copy to a .cpp le
Add #include nt2/nt2.hpp and do cosmetic changes
Compile the le and link with libnt2.a
22 of 41
34. NT2 - From M to C++
M code
A1 = 1 : 1 0 0 0 ;
A2 = A1 + randn ( size ( A1 ) ) ;
X = lu ( A1 * A1 ’) ;
rms = sqrt ( sum ( sqr ( A1 (:) - A2 (:) ) ) / numel ( A1 ) ) ;
NT2 code
table double A1 = _ (1. ,1000.) ;
table double A2 = A1 + randn ( size ( A1 ) ) ;
table double X = lu ( m t i m e s ( A1 , trans ( A1 ) ) ;
d o u b l e rms = sqrt ( sum ( sqr ( A1 ( _ ) - A2 ( _ ) ) ) / numel ( A1 ) ) ;
23 of 41
36. Parallel Skeletons extraction process
A = B / sum(C+D);
; ;
=
A =
B sum
+
C D
fold
transform
=
tmp sum
+
C D
fold
)
=
A =
B tmp
transform
25 of 41
37. From data to task parallelism
Antoine Tran Tan PHD, 2012-2015
Limits of the fork-join model
Synchronization cost due to implicit barriers
Under-exploitation of potential parallelism
Poor data locality and no inter-statement optimization
26 of 41
38. From data to task parallelism
Antoine Tran Tan PHD, 2012-2015
Limits of the fork-join model
Synchronization cost due to implicit barriers
Under-exploitation of potential parallelism
Poor data locality and no inter-statement optimization
Skeletons from the Future
Adapt current skeletons for taskication
Use Futures ( or HPX) to automatically pipeline
Derive a dependency graph between statements
26 of 41
39. Parallel Skeletons extraction process - Take 2
A = B / sum(C+D);
; ;
=
tmp sum
+
C D
fold
=
A =
B tmp
transform
27 of 41
41. Motion Detection
Lacassagne et al., ICIP 2009
Sigma-Delta algorithm based on background substraction
Use local gaussian model of lightness variation to detect motion
Challenge: Very low arithmetic density
Challenge: Integer-based implementation with small range
29 of 41
42. Motion Detection
table char s i g m a _ d e l t a ( table char b a c k g r o u n d
, table char const frame
, table char v a r i a n c e
)
{
// E s t i m a t e Raw M o v e m e n t
b a c k g r o u n d = s e l i n c ( b a c k g r o u n d frame
, s e l d e c ( b a c k g r o u n d frame , b a c k g r o u n d )
) ;
table char diff = dist ( background , frame ) ;
// C o m p u t e Local V a r i a n c e
table char sig3 = muls ( diff ,3) ;
var = i f _ e l s e ( diff != 0
, s e l i n c ( v a r i a n c e sig3
, s e l d e c ( var sig3 , v a r i a n c e )
)
, v a r i a n c e
) ;
// G e n e r a t e M o v e m e n t Label
r e t u r n i f _ z e r o _ e l s e _ o n e ( diff v a r i a n c e ) ;
}
30 of 41
44. Black and Scholes Option Pricing
NT2 Code
table float b l a c k s c h o l e s ( table float const Sa , table float const Xa
, table float const Ta
, table float const ra , table float const va
)
{
table float da = sqrt ( Ta ) ;
table float d1 = log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) ;
table float d2 = d1 - va * da ;
r e t u r n Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) ;
}
32 of 41
45. Black and Scholes Option Pricing
NT2 Code with loop fusion
table float b l a c k s c h o l e s ( table float const Sa , table float const Xa
, table float const Ta
, table float const ra , table float const va
)
{
// P r e a l l o c a t e t e m p o r a r y t a b l e s
table float da ( e x t e n t ( Ta ) ) , d1 ( e x t e n t ( Ta ) ) , d2 ( e x t e n t ( Ta ) ) , R ( e x t e n t ( Ta ) ) ;
// tie merge loop nest and i n c r e a s e cache l o c a l i t y
tie ( da , d1 , d2 , R ) = tie ( sqrt ( Ta )
, log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da )
, d1 - va * da
, Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 )
) ;
r e t u r n R ;
}
32 of 41
46. Black and Scholes Option Pricing
Performance
1000000
150
100
50
0
x1.89
x2.91
x5.58
x6.30
Size
cycle/value
scalar
SSE2
AVX2
SSE2, 4 cores
AVX2, 4 cores
33 of 41
47. Black and Scholes Option Pricing
Performance with loop fusion/futurisation
1000000
150
100
50
0
x2.27
x4.13
x8.05
x11.12
Size
cycle/value
scalar
SSE2
AVX2
SSE2, 4 cores
AVX2, 4 cores
34 of 41
51. Conclusion
Parallel Computing for Scientist
Software Libraries built as Generic and Generative components can solve a large
chunk of parallelism related problems while being easy to use.
Like regular language, DSEL needs informations about the hardware system
Integrating hardware descriptions as Generic components increases tools portability
and re-targetability
38 of 41
52. Conclusion
Parallel Computing for Scientist
Software Libraries built as Generic and Generative components can solve a large
chunk of parallelism related problems while being easy to use.
Like regular language, DSEL needs informations about the hardware system
Integrating hardware descriptions as Generic components increases tools portability
and re-targetability
Our Achievements
A new method for parallel software development
Efficient libraries working on large subset of hardware
High level of performances across a wide application spectrum
38 of 41
53. Works in Progress
Application to Accelerators
Exploration of proper skeleton implementation on GPUs
Adaptation of Future based code generator
In progress with Ian Masliah’ PHD thesis
Parallelism within C++
SIMD as part of the standard library
Proposal N3571 for standard SIMD computation
Interoperability with current parallel model of C++
39 of 41
54. Perspectives
DSEL as C++ rst class idiom
Build partial evaluation into the language
Ease transition between regular and meta C++
Mid-term Prospect: metaOCAML like quoting for C++
DSEL and compilers relationship
C++ DSEL hits a limit on their applicability
Compilers often lack high level informations for proper optimization
Mid-term Prospect: Hybrid library/compiler approaches for DSEL
40 of 41