Parallel Pixie Dust (PPD) is a C++ library that aims to make parallel programming easier and safer by avoiding deadlocks and race conditions. It uses futures and thread pools to schedule work across multiple threads. PPD allows work to be distributed across threads in different models and provides various thread pool types. It handles exceptions thrown across threads to improve debugging of multithreaded code. The library was designed based on an examination of threading patterns in SPEC2006 benchmarks.
This document discusses using cloud computing resources for cheminformatics applications. It describes how Hadoop and MapReduce can be used to perform large-scale parallel computations on chemical data and databases. Specific examples discussed include counting atoms in large datasets using MapReduce and performing substructure searches using SMARTS queries on Hadoop. The document also compares different approaches to programming Hadoop applications and how Pig Latin can simplify writing cheminformatics jobs for Hadoop.
This document provides an introduction and agenda for a course on parallelism and code optimization in C/C++ and Fortran for large data analysis using MPI. The course covers topics like processor architectures, vectorization, OpenMP programming, and data exchange rules and principles. Parallel programming approaches covered include shared memory, distributed memory, and hybrid memory models. Shared memory is the most common parallelization approach, where multiple processors can access and exchange data through a common memory. Distributed memory uses message passing between processors with separate memory. The course aims to help with the growing use of HPC, big data, and deep learning by exploring how to leverage HPC technologies to improve bottlenecks in big data processing and deep learning middleware.
This document discusses the possibilities of Pixie Dust technologies, including generating computational acoustic fields through phased array transducers to create non-contact haptic feedback, directional speakers, and applications in pollination, vehicles, traffic signals, guide dogs, and spacecraft. It presents the prototype design of an acoustic field generator and results from deploying their research into society. The long term vision is to calculate and generate potential fields to create magical experiences beyond just motion pictures.
This document proposes a new technique called "Pixie Dust" that uses an acoustic potential field generated by phased arrays to levitate and animate small objects for graphical display and interaction. It summarizes the theory behind acoustic levitation using phased arrays, demonstrates the implementation of an acoustic potential field generator, and evaluates the workspace and speed capabilities. Potential applications explored include projection screens, spatial displays, and vector graphics displays. Future work areas discussed are wave synthesis, multi-layer displays, and production processes.
This document discusses "pixie dust", which is an informal name for ruthenium layers used in hard drives to increase data density. Pixie dust allows for antiferromagnetically-coupled (AFC) media drives that can store significantly more data than conventional drives. AFC media uses a multi-layer structure with two magnetic layers separated by a very thin ruthenium layer to reduce noise and prevent data decay at high densities, allowing for much smaller grain sizes and greater than 10x the data density of conventional drives.
The document introduces Parallel Pixie Dust (PPD), a cross-platform thread library that aims to guarantee deadlock-free and race-condition free schedules that are optimal. It discusses the need for multiple threads due to factors like the memory wall. Current threading models are problematic because testing and debugging threaded code is difficult. PPD uses futures and thread pools to simulate data flow and generate tree-like thread schedules. It provides parallel versions of functions and thread-safe containers to enable multi-threaded standard library algorithms. The goal is to make writing correct multi-threaded programs easier.
This document discusses using cloud computing resources for cheminformatics applications. It describes how Hadoop and MapReduce can be used to perform large-scale parallel computations on chemical data and databases. Specific examples discussed include counting atoms in large datasets using MapReduce and performing substructure searches using SMARTS queries on Hadoop. The document also compares different approaches to programming Hadoop applications and how Pig Latin can simplify writing cheminformatics jobs for Hadoop.
This document provides an introduction and agenda for a course on parallelism and code optimization in C/C++ and Fortran for large data analysis using MPI. The course covers topics like processor architectures, vectorization, OpenMP programming, and data exchange rules and principles. Parallel programming approaches covered include shared memory, distributed memory, and hybrid memory models. Shared memory is the most common parallelization approach, where multiple processors can access and exchange data through a common memory. Distributed memory uses message passing between processors with separate memory. The course aims to help with the growing use of HPC, big data, and deep learning by exploring how to leverage HPC technologies to improve bottlenecks in big data processing and deep learning middleware.
This document discusses the possibilities of Pixie Dust technologies, including generating computational acoustic fields through phased array transducers to create non-contact haptic feedback, directional speakers, and applications in pollination, vehicles, traffic signals, guide dogs, and spacecraft. It presents the prototype design of an acoustic field generator and results from deploying their research into society. The long term vision is to calculate and generate potential fields to create magical experiences beyond just motion pictures.
This document proposes a new technique called "Pixie Dust" that uses an acoustic potential field generated by phased arrays to levitate and animate small objects for graphical display and interaction. It summarizes the theory behind acoustic levitation using phased arrays, demonstrates the implementation of an acoustic potential field generator, and evaluates the workspace and speed capabilities. Potential applications explored include projection screens, spatial displays, and vector graphics displays. Future work areas discussed are wave synthesis, multi-layer displays, and production processes.
This document discusses "pixie dust", which is an informal name for ruthenium layers used in hard drives to increase data density. Pixie dust allows for antiferromagnetically-coupled (AFC) media drives that can store significantly more data than conventional drives. AFC media uses a multi-layer structure with two magnetic layers separated by a very thin ruthenium layer to reduce noise and prevent data decay at high densities, allowing for much smaller grain sizes and greater than 10x the data density of conventional drives.
The document introduces Parallel Pixie Dust (PPD), a cross-platform thread library that aims to guarantee deadlock-free and race-condition free schedules that are optimal. It discusses the need for multiple threads due to factors like the memory wall. Current threading models are problematic because testing and debugging threaded code is difficult. PPD uses futures and thread pools to simulate data flow and generate tree-like thread schedules. It provides parallel versions of functions and thread-safe containers to enable multi-threaded standard library algorithms. The goal is to make writing correct multi-threaded programs easier.
A Methodology for the Emulation of Boolean Logic that Paved the Way for the S...ricky_pi_tercios
This document proposes a methodology called Maze for investigating linked lists and emulating Boolean logic. Maze visualizes superpages and aims to overcome issues with existing approaches. It consists of a virtual machine monitor and codebase that seeks to solve challenges like caching algorithms independently and controlling voice-over-IP without context-free grammar. The implementation contains thousands of lines of code in various programming languages.
A Domain-Specific Embedded Language for Programming Parallel Architectures.Jason Hearne-McGuiness
This document proposes a domain-specific embedded language (DSEL) for programming parallel architectures. The DSEL aims to enable parallelism while avoiding issues like deadlocks, race conditions, and complex APIs. It presents the grammar and properties of the proposed DSEL, including that it generates schedules that are deadlock-free and race-condition free. Examples demonstrating data flow and data parallelism using the DSEL are also provided.
1) The document provides a tutorial on Keras with the objective of introducing deep learning and how to use the Keras library in Python.
2) The tutorial covers installing Keras, configuring backends, an overview of deep learning concepts, and how to build models using Keras modules, layers, model compilation, CNNs, and LSTMs.
3) Examples are provided on creating simple Keras models using different layers like Dense, Dropout, and Flatten to demonstrate how models can be built for classification tasks.
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...Jason Hearne-McGuiness
The document summarizes a presentation on using data-parallelism in C++ to parallelize the SPEC2006 benchmark suite. It discusses the design of a data-flow library called Parallel Pixie Dust (PPD) and how it was used to parallelize some STL algorithms. Analysis of two SPEC2006 benchmarks found that only a small number of STL algorithm usages could be easily parallelized due to the functional decomposition of the code into small blocks and potential side effects. Larger functions and avoidance of side effects may enable more opportunities for parallelization.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
100 bugs in Open Source C/C++ projects Andrey Karpov
This article demonstrates capabilities of the static code analysis methodology. The readers are offered to study the samples of one hundred errors found in open-source projects in C/C++.
This document discusses optimizations for TCP/IP networking performance on multicore systems. It describes several inefficiencies in the Linux kernel TCP/IP stack related to shared resources between cores, broken data locality, and per-packet processing overhead. It then introduces mTCP, a user-level TCP/IP stack that addresses these issues through a thread model with pairwise threading, batch packet processing from I/O to applications, and a BSD-like socket API. mTCP achieves a 2.35x performance improvement over the kernel TCP/IP stack on a web server workload.
The document discusses dynamic memory allocation in C and some of the challenges that can arise. Specifically, it notes that malloc(1) will not return a 1 byte block due to overhead in allocation structures. This overhead includes space for the size field and a pointer to the next free block. The document also provides an example of undefined behavior that can occur when freeing a pointer that is later accessed, and offers a solution using a temporary variable. Overall, the document outlines the basics of dynamic memory allocation and some issues programmers need to be aware of.
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsStijn Decubber
Slides from the TensorFlow meetup hosted on October 9th at the ML6 offices in Ghent. Join our Meetup group for updates and future sessions: https://www.meetup.com/TensorFlow-Belgium/
This PPT File helps IT freshers with the Basic Interview Questions, which will boost there confidence before going to the Interview. For more details and Interview Questions please log in www.rekruitin.com and click on Job Seeker tools. Also register on the and get employed.
By ReKruiTIn.com
Using RAG to create your own Podcast conversations.pdfRichard Rodger
The presentation on "Retrieval Augmented Generation for Interactive Podcasts" outlines a method to transform podcast audio into interactive chat interfaces. It covers the design, coding, and practical aspects of using Retrieval Augmented Generation (RAG) to emulate podcast guest responses. The project involves processing a significant volume of podcast data, including episodes, transcripts, and metadata.
The technical discussion focuses on ingesting audio and metadata into an AI system and querying it for conversational responses. Key concepts introduced include vector embedding, which converts text to conceptual vectors using models, and the application of Large Language Models (LLMs) with Transformer architecture for context understanding.
The coding segment details microservice messages for transcript ingestion and chat functionalities, employing transcription services, embedding techniques, and vector storage solutions. Challenges in RAG project deployment are also discussed, highlighting performance, quality, regressions, and managing expectations.
The presentation concludes by contrasting technical complexities with a philosophical vision of AI's potential, inspired by speculative fiction, suggesting a future where AI capabilities vastly exceed human cognitive functions. Further resources and open-source implementations are provided for those interested in the technical development of interactive podcast systems
1. The document discusses using Deeplearning4j and Kafka together for machine learning workflows. It describes how Deeplearning4j can be used to build, train, and deploy neural networks on JVM and Spark, while Kafka can be used to stream data for training and inference.
2. An example application is described that performs anomaly detection on log files from a CDN by aggregating the data to reduce the number of data points. This allows the model to run efficiently on available GPU hardware.
3. The document provides a link to a GitHub repository with a code example that uses Kafka to stream data, Keras to train a model, and Deeplearning4j to perform inference in Java and deploy the model.
Constructing Operating Systems and E-CommerceIJARIIT
Information retrieval systems and the partition table, while essential in theory, have not until recently been considered important [15]. In fact, few theorists would disagree with the deployment of massive multiplayer online role-playing games, which embodies the robust principles of complexity theory. In this work we investigate how Smalltalk can be applied to the synthesis of lambda calculus.
The design and implementation of modern column oriented databasesTilak Patidar
An attempt to break down the paper on the design of column-oriented databases into simpler terms.
https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
https://blog.acolyer.org/2018/09/26/the-design-and-implementation-of-modern-column-oriented-database-systems/
This article demonstrates capabilities of the static code analysis methodology. The readers are offered to study the samples of one hundred errors found in open-source projects in C/C++. All the errors have been found with the PVS-Studio static code analyzer.
Caffe2 is a deep learning framework built for mobile and large-scale deployments. It improves on Caffe 1.0 with first-class support for distributed training, mobile deployment, new hardware support, flexibility for future directions like quantized computation, and stress testing from Facebook applications. PyTorch focuses on research and experimentation while Caffe2 targets industrial applications and mobile. ONNX allows models to be interoperable between frameworks like Caffe2, PyTorch, and TensorFlow.
The document discusses Spark streaming and machine learning concepts like logistic regression, linear regression, and clustering algorithms. It provides code examples in Scala and Python showing how to perform binary classification on streaming data using Spark MLlib. Links and documentation are referenced for setting up streaming machine learning pipelines to train models on streaming data in real-time and make predictions.
Software tools for high-throughput materials data generation and data miningAnubhav Jain
Atomate and matminer are open-source Python libraries for high-throughput materials data generation and data mining. Atomate makes it easy to automatically generate large datasets by running standardized computational workflows with different simulation packages. Matminer contains tools for featurizing materials data and integrating it with machine learning algorithms and data visualization methods. Both aim to accelerate materials discovery by automating and standardizing computational workflows and data analysis tasks.
More Related Content
Similar to Introducing Parallel Pixie Dust: Advanced Library-Based Support for Parallelism in C++
A Methodology for the Emulation of Boolean Logic that Paved the Way for the S...ricky_pi_tercios
This document proposes a methodology called Maze for investigating linked lists and emulating Boolean logic. Maze visualizes superpages and aims to overcome issues with existing approaches. It consists of a virtual machine monitor and codebase that seeks to solve challenges like caching algorithms independently and controlling voice-over-IP without context-free grammar. The implementation contains thousands of lines of code in various programming languages.
A Domain-Specific Embedded Language for Programming Parallel Architectures.Jason Hearne-McGuiness
This document proposes a domain-specific embedded language (DSEL) for programming parallel architectures. The DSEL aims to enable parallelism while avoiding issues like deadlocks, race conditions, and complex APIs. It presents the grammar and properties of the proposed DSEL, including that it generates schedules that are deadlock-free and race-condition free. Examples demonstrating data flow and data parallelism using the DSEL are also provided.
1) The document provides a tutorial on Keras with the objective of introducing deep learning and how to use the Keras library in Python.
2) The tutorial covers installing Keras, configuring backends, an overview of deep learning concepts, and how to build models using Keras modules, layers, model compilation, CNNs, and LSTMs.
3) Examples are provided on creating simple Keras models using different layers like Dense, Dropout, and Flatten to demonstrate how models can be built for classification tasks.
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...Jason Hearne-McGuiness
The document summarizes a presentation on using data-parallelism in C++ to parallelize the SPEC2006 benchmark suite. It discusses the design of a data-flow library called Parallel Pixie Dust (PPD) and how it was used to parallelize some STL algorithms. Analysis of two SPEC2006 benchmarks found that only a small number of STL algorithm usages could be easily parallelized due to the functional decomposition of the code into small blocks and potential side effects. Larger functions and avoidance of side effects may enable more opportunities for parallelization.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
100 bugs in Open Source C/C++ projects Andrey Karpov
This article demonstrates capabilities of the static code analysis methodology. The readers are offered to study the samples of one hundred errors found in open-source projects in C/C++.
This document discusses optimizations for TCP/IP networking performance on multicore systems. It describes several inefficiencies in the Linux kernel TCP/IP stack related to shared resources between cores, broken data locality, and per-packet processing overhead. It then introduces mTCP, a user-level TCP/IP stack that addresses these issues through a thread model with pairwise threading, batch packet processing from I/O to applications, and a BSD-like socket API. mTCP achieves a 2.35x performance improvement over the kernel TCP/IP stack on a web server workload.
The document discusses dynamic memory allocation in C and some of the challenges that can arise. Specifically, it notes that malloc(1) will not return a 1 byte block due to overhead in allocation structures. This overhead includes space for the size field and a pointer to the next free block. The document also provides an example of undefined behavior that can occur when freeing a pointer that is later accessed, and offers a solution using a temporary variable. Overall, the document outlines the basics of dynamic memory allocation and some issues programmers need to be aware of.
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsStijn Decubber
Slides from the TensorFlow meetup hosted on October 9th at the ML6 offices in Ghent. Join our Meetup group for updates and future sessions: https://www.meetup.com/TensorFlow-Belgium/
This PPT File helps IT freshers with the Basic Interview Questions, which will boost there confidence before going to the Interview. For more details and Interview Questions please log in www.rekruitin.com and click on Job Seeker tools. Also register on the and get employed.
By ReKruiTIn.com
Using RAG to create your own Podcast conversations.pdfRichard Rodger
The presentation on "Retrieval Augmented Generation for Interactive Podcasts" outlines a method to transform podcast audio into interactive chat interfaces. It covers the design, coding, and practical aspects of using Retrieval Augmented Generation (RAG) to emulate podcast guest responses. The project involves processing a significant volume of podcast data, including episodes, transcripts, and metadata.
The technical discussion focuses on ingesting audio and metadata into an AI system and querying it for conversational responses. Key concepts introduced include vector embedding, which converts text to conceptual vectors using models, and the application of Large Language Models (LLMs) with Transformer architecture for context understanding.
The coding segment details microservice messages for transcript ingestion and chat functionalities, employing transcription services, embedding techniques, and vector storage solutions. Challenges in RAG project deployment are also discussed, highlighting performance, quality, regressions, and managing expectations.
The presentation concludes by contrasting technical complexities with a philosophical vision of AI's potential, inspired by speculative fiction, suggesting a future where AI capabilities vastly exceed human cognitive functions. Further resources and open-source implementations are provided for those interested in the technical development of interactive podcast systems
1. The document discusses using Deeplearning4j and Kafka together for machine learning workflows. It describes how Deeplearning4j can be used to build, train, and deploy neural networks on JVM and Spark, while Kafka can be used to stream data for training and inference.
2. An example application is described that performs anomaly detection on log files from a CDN by aggregating the data to reduce the number of data points. This allows the model to run efficiently on available GPU hardware.
3. The document provides a link to a GitHub repository with a code example that uses Kafka to stream data, Keras to train a model, and Deeplearning4j to perform inference in Java and deploy the model.
Constructing Operating Systems and E-CommerceIJARIIT
Information retrieval systems and the partition table, while essential in theory, have not until recently been considered important [15]. In fact, few theorists would disagree with the deployment of massive multiplayer online role-playing games, which embodies the robust principles of complexity theory. In this work we investigate how Smalltalk can be applied to the synthesis of lambda calculus.
The design and implementation of modern column oriented databasesTilak Patidar
An attempt to break down the paper on the design of column-oriented databases into simpler terms.
https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
https://blog.acolyer.org/2018/09/26/the-design-and-implementation-of-modern-column-oriented-database-systems/
This article demonstrates capabilities of the static code analysis methodology. The readers are offered to study the samples of one hundred errors found in open-source projects in C/C++. All the errors have been found with the PVS-Studio static code analyzer.
Caffe2 is a deep learning framework built for mobile and large-scale deployments. It improves on Caffe 1.0 with first-class support for distributed training, mobile deployment, new hardware support, flexibility for future directions like quantized computation, and stress testing from Facebook applications. PyTorch focuses on research and experimentation while Caffe2 targets industrial applications and mobile. ONNX allows models to be interoperable between frameworks like Caffe2, PyTorch, and TensorFlow.
The document discusses Spark streaming and machine learning concepts like logistic regression, linear regression, and clustering algorithms. It provides code examples in Scala and Python showing how to perform binary classification on streaming data using Spark MLlib. Links and documentation are referenced for setting up streaming machine learning pipelines to train models on streaming data in real-time and make predictions.
Software tools for high-throughput materials data generation and data miningAnubhav Jain
Atomate and matminer are open-source Python libraries for high-throughput materials data generation and data mining. Atomate makes it easy to automatically generate large datasets by running standardized computational workflows with different simulation packages. Matminer contains tools for featurizing materials data and integrating it with machine learning algorithms and data visualization methods. Both aim to accelerate materials discovery by automating and standardizing computational workflows and data analysis tasks.
Similar to Introducing Parallel Pixie Dust: Advanced Library-Based Support for Parallelism in C++ (20)
Software tools for high-throughput materials data generation and data mining
Introducing Parallel Pixie Dust: Advanced Library-Based Support for Parallelism in C++
1. Introducing Parallel Pixie Dust:
Advanced Library-Based Support for Parallelism in C++
BCS-DSG April 2011
Jason Mc Guiness, Colin Egan
ppd-bcs-dsg-11@hussar.demon.co.uk
c.egan@herts.ac.uk
http://libjmmcg.sf.net/
April 7, 2011
2. Sequence of Presentation.
An introduction: why I am here.
Why do we need multiple threads?
Multiple processors, multiple cores....
How do we manage all these threads:
libraries, languages and compilers.
My problem statement, an experiment if you will...
An outline of Parallel Pixie Dust (PPD).
Important theorems.
Examples, and their motivation.
Conclusion.
Future directions.
3. Introduction.
Why yet another thread-library presentation?
Because we still find it hard to write multi-threaded programs
correctly.
According to programming folklore.
We haven’t successfully replaced the von Neumann
architecture:
Stored program architectures are still prevalent.
The memory wall still affects us:
The CPU-instruction retirement rate, i.e. rate at which
programs require and generate data, exceeds the the memory
bandwidth - a by product of Moore’s Law.
Modern architectures add extra cores to CPUs, in this
instance, extra memory buses which feed into those cores.
4. How many Threads?
The future appears to be laid out: inexorably more execution units.
For example quad-core processors are becoming more popular.
4-way blade frames with quad cores provide 16 execution units
in close proximity.
picoChip (from Bath, U.K.) has hundreds of execution units
per chip, with up to 16 in a network.
Future supercomputers will have many millions of execution
units, e.g. IBM BlueGene/C Cyclops64.
5. A Quick Review of Some Threading Models:
Restrict ourselves to library-based techniques.
Raw library calls.
The “thread as an active class”.
Co-routines.
The “thread pool” containing those objects.
Producer-consumer models are a sub-set of this model.
The issue: classes used to implement business logic also
implement the operations to be run on the threads. This often
means that the locking is intimately entangled with those
objects, and possibly even the threading logic.
This makes it much harder to debug these applications. The
question of how to implement multi-threaded debuggers
correctly is still an open question.
6. What Is Parallel Pixie Dust (PPD)?
I decided to set myself a challenge, an experiment if you will, which
may be summarized in this problem statement:
Is it possible to create a cross-platform thread library that can
guarantee a schedule that is deadlock & race-condition free, is
efficient, and also assists in debugging the business logic, in a
general-purpose, imperative language that has little or no
native threading support? Moreover the library should offer
considerable flexibility, e.g. reflect the various types of
thread-related costs and techniques.
7. Part 1: Design Features of PPD.
The main design features of PPD are:
It targets general purpose threading using a data-flow model of
parallelism:
This type of scheduling may be viewed as dynamic scheduling
(run-time) as opposed to static scheduling (potentially
compile-time), where the operations are statically assigned to
execution pipelines, with relatively fixed timing constraints.
DSEL implemented as futures and thread pools (of many
different types using traits).
Can be used to implement a tree-like thread schedule.
“thread-as-an-active-class” exists.
8. Part 2: Design Features of PPD.
Extensions to the basic DSEL have been added:
Certain binary functors: each operand is executed in parallel.
Adaptors for the STL collections to assist with thread-safety.
Combined with thread pools allows replacement of the STL
algorithms.
Implemented GSS(k), or baker’s scheduling:
May reduce synchronisation costs on thread-limited systems or
pools in which the synchronisation costs are high.
Amongst other influences, PPD was born out of discussions
with Richard Harris and motivated by Kevlin Henney’s
presentation to ACCU’04 regarding threads.
9. More Details Regarding the Futures.
The use of futures, termed execution contexts, within PPD is
crucial:
This hides the data-related locks from the user, as the future
wraps the retrieval of the data with a hidden lock, the
resultant future-like object behaving like a proxy.
This is an implementation of the split-phase constraint.
They are only stack-based, cannot be heap-allocated and only
const references may be taken.
This guarantees that there can be no aliassing of these objects.
This provides extremely useful guarantees that will be used
later.
They can only be created from the returned object when work
is transferred into the pool.
10. Part 1: The Thread Pools.
Primary method of controlling threads is thread pools, available in
many different types:
Work-distribution models: e.g. master-slave, work-stealing.
Size models: e.g. fixed size, unlimited.....
Threading implementations: e.g. sequential, PThreads, Win32:
The sequential trait means all threading within the library can
be removed, but maintaining the interface. One simple
recompilation and all the bugs left are yours, i.e. your business
model code.
A separation of design constraints. Also side-steps the
debugger issue!
11. Part 2: The Thread Pools.
Contains a special queue that contains the work in strict FIFO
order or user-defined priority order.
Threading models: e.g. the cost of performing threading, i.e.
heavyweight like PThreads or Win32.
Further assist in managing OS threads and exceptions across
the thread boundaries via futures.
i.e. a PRAM programming model.
12. Exceptions.
Passing exceptions across the thread boundary is a thorny issue:
Futures are used to receive any exceptions that may be thrown
by the mutation when it is executed within the thread pool.
If the data passed back from a thread-pool is not retrieved
then the future may throw that exception from its destructor!
The design idea expressed is that the exception is important,
and if something might throw, then the execution context
must be examined to verify that the mutation executed in the
pool succeeded.
13. Ugly Details.
All the traits make the objects look quite complex.
typedefs are used so that the thread pools and execution
contexts are aliased to more convenient names.
It requires undefined behaviour or broken compilers, for
example:
Exceptions and multiple threads are unspecified in the current
standard. Fortunately most compilers “do the right thing” and
implement an exception stack per thread, but this is not
standardized.
Currently, most optimizers in compilers are broken with regard
to code hoisting. This affects the use of atomic counters and
potentially intrinsic locks having code that they are guarding
hoisted around them.
volatile does not solve this, because volatile has been too
poorly specified to rely upon. It does not specify that
operations on a set of objects must come after an operation
on an unrelated object.
14. An Examination of SPEC2006 for STL Usage.
SPEC2006 was examined for usage of STL algorithms used within
it to guide implementation of algorithms within PPD.
Distribution of algorithms in 483.xalancbmk. Distribution of algorithms in 477.dealII.
100 100
set::count set::count
map::count map::count
string::find string::find
set::find set::find
80 80
map::find map::find
Global algorithm Global algorithm
60 60
Number of occurences.
Number of occurences.
40 40
20 20
0 0
copy
fill
sort
find
count_if
accumulate
min_element
inner_product
transform
inplace_merge
find_if
binary_search
find
for_each
swap
find
find_if
copy
reverse
remove
transform
stable_sort
sort
replace
fill
count
fill_n
swap
count
lower_bound
max_element
upper_bound
unique
remove_if
nth_element
for_each
equal
adjacent_find
Algorithm.
Algorithm.
15. The STL-style Parallel Algorithms.
That analysis, amongst other reasons lead me to implement:
1. for_each()
2. find_if() and find()
3. count_if() and count()
4. transform() (both overloads)
5. copy()
6. accumulate() (both overloads)
16. Technical Details, Theorems.
Due to the restricted properties of the execution contexts and the
thread pools a few important results arise:
1. The thread schedule created is only an acyclic, directed graph.
In fact it is a tree.
2. From this property I have proved that the schedule PPD
generates is deadlock and race-condition free.
3. Moreover in implementing the STL-style algorithms those
implementations are efficient, i.e. there are provable bounds
on both the execution time and minimum number of
processors required to achieve that time.
17. The Key Theorems: Deadlock & Race-Condition Free.
1. Deadlock Free:
Theorem
That if the user refrains from using any other threading-related
items or atomic objects other than those provided by PPD, for
example those in 19 and 21, then they can be guaranteed to have a
schedule free of deadlocks.
2. Race-condition Free:
Theorem
That if the user refrains from using any other threading-related
items or atomic objects other than those provided by PPD, for
example those in 19 and 21, and that the work they wish to mutate
may not be aliased by another object, then the user can be
guaranteed to have a schedule free of race conditions.
18. The Key Theorems: Efficient Schedule with Futures.
Theorem
In summary, if the user only uses the threading-related items
provided by the execution contexts, then the schedule of work
transferred to PPD will be executed in an algorithmic order that is
n
O p + log (p) , where n is the number of tasks and p the size of
the thread pool. In those cases PPD will add at most a constant
order to the execution time of the schedule.
Note that the user might implement code that orders the
transfers of work in a sub-optimal manner, PPD has very
limited abilities to automatically improve such schedules.
This theorem states that once the work has been transferred,
PPD will not add any further sub-efficient algorithmic delays,
but might add some constant-order delay.
19. Basic Example using execution_contexts.
Listing 1: General-Purpose use of a Thread Pool and Future.
struct res_t {
int i ;
};
struct work_type {
v o i d p r o c e s s ( r e s _ t &) {}
};
pool_type pool ( 2 ) ;
e x e c u t i o n _ c o n t e x t c o n t e x t ( p o o l < j o i n a b l e ()<< t i m e _ c r i t i c a l ()<< work_type ( ) ) ;
<
pool . erase ( context ) ;
context− ; >i
To assist in making the code appear to be readable, a number
of typedefs have been omitted.
The execution_context is created from adding the wrapped
work to the pool.
process(res_t &) is the only invasive artifact of the library
for this use-case.
Alternative member functions may be specified, if a more
complex form is used.
Compile-time check: the type of the transferred work is the
same as that declared in the execution_context.
20. The typedefs used.
Listing 2: typedefs in the execution context example.
t y p e d e f ppd : : t h r e a d _ p o o l <
p o o l _ t r a i t s : : worker_threads_get_work , p o o l _ t r a i t s : : f i x e d _ s i z e ,
pool_adaptor<
g e n e r i c _ t r a i t s : : j o i n a b l e , platform_api , heavyweight_threading ,
pool_traits : : normal_fifo , std : : less ,1
>
> pool_type ;
t y p e d e f p o o l _ t y p e : : c r e a t e 1 <work_type > : : c r e a t o r _ t ;
typedef creator_t : : execution_context execution_context ;
typedef creator_t : : j o i n a b l e j o i n a b l e ;
t y p e d e f p o o l _ t y p e : : p r i o r i t y <a p i _ p a r a m s _ t y p e : : t i m e _ c r i t i c a l > t i m e _ c r i t i c a l ;
Expresses the flexibility & optimizations available:
pool_traits::prioritised_queue: specifies work should
be ordered by the trait std::less.
Could be used to order work by an estimate of time taken to
complete - very interesting, but not implemented.
GSS(k) scheduling implemented: number specifies batch size.
execution_context checks the argument-type of
process(...) at compile-time,
also if const, i.e. pure, this could be used to implement
memoizing - also very interesting, but not implemented.
21. Example using for_each().
Listing 3: For Each with a Thread Pool and Future.
t y p e d e f ppd : : s a f e _ c o l l n <
v e c t o r <i n t >, l o c k _ t r a i t s : : r e c u r s i v e _ c r i t i c a l _ s e c t i o n _ l o c k _ t y p e
> vtr_colln_t ;
t y p e d e f pool_type : : void_exec_cxt execution_context ;
s t r u c t accumulate {
s t a t i c i n t l a s t ; // f o r _ e a c h n e e d s some k i n d o f s i d e −e f f e c t . . .
v o i d o p e r a t o r ( ) ( i n t t ) { l a s t+=t ; } ;
};
vtr_colln_t v ;
v . push_back ( 1 ) ; v . push_back ( 2 ) ;
e x e c u t i o n _ c o n t e x t c o n t e x t ( p o o l < j o i n a b l e ()<< p o o l . f o r _ e a c h ( v , a c c u m u l a t e ( ) ) ;
<
∗context ;
The safe_colln adaptor provides the lock for the collection,
in this case a simple EREW lock,
only locks the collection, not the elments in it.
The for_each() returns an execution_context:
The input collection has a read-lock taken on it.
Released when all of the asynchronous applications of the
functor have completed, when the read-lock has been dropped.
It makes no sense to return a random copy of the input functor.
22. Example using transform().
Listing 4: Transform with a Thread Pool and Future.
t y p e d e f ppd : : s a f e _ c o l l n <
v e c t o r <l o n g >, l o c k : : rw<l o c k _ t r a i t s > : : d e c a y i n g _ w r i t e ,
n o _ s i g n a l l i n g , l o c k : : rw<l o c k _ t r a i t s > : : r e a d
> vtr_out_colln_t ;
v t r _ c o l l n _ t v ; v t r _ o u t _ c o l l n _ t v_out ;
v . push_back ( 1 ) ; v . push_back ( 2 ) ;
pool_type : : void_exec_cxt c o n t e x t (
p o o l < j o i n a b l e ()<< p o o l . t r a n s f o r m (
<
v , v_out , s t d : : n e g a t e <v t r _ c o l l n _ t : : v a l u e _ t y p e >()
)
);
∗context ;
vtr_out_colln_t provides a CREW lock on v_out: it is
re-sized, then the write-lock atomically decays to a read-lock.
Note how the collection is locked, but not the data elements
themselves.
The transform() returns an execution_context:
Released when all of the asynchronous applications of the
unary operation have completed & the read-locks have been
dropped.
Similarly, it makes no sense to return a random iterator.
23. Example using accumulate().
Listing 5: Accumulate with a Thread Pool and Future.
typedef pool_type : : accumulate_t<
vtr_colln_t
>:: e x e c u t i o n _ c o n t e x t e x e c u t i o n _ c o n t e x t ;
vtr_colln_t v ;
v . push_back ( 1 ) ; v . push_back ( 2 ) ;
execution_context context (
p o o l . a c c u m u l a t e ( v , 1 , s t d : : p l u s <v t r _ c o l l n _ t : : v a l u e _ t y p e > ( ) )
);
∗context ;
The accumulate() returns an execution_context:
Released when all of the asynchronous applications of the
binary operation have completed & the read-lock is dropped.
Note the use of the accumulate_t type: it contains a
specialized counter that atomically accrues the result using
suitable locking according to the API and type.
This is effectively a map-reduce operation.
find(), find_if(), count() and count_if() are also
implemented, but omitted, for brevity.
24. Example using logical_or().
Listing 6: Logical Or with a Thread Pool and a Future.
struct bool_work_type {
v o i d p r o c e s s ( b o o l &) c o n s t {}
};
pool_type pool ( 2 ) ;
pool_type : : bool_exec_cxt c o n t e x t (
p o o l . l o g i c a l _ o r ( bool_work_type ( ) , bool_work_type ( ) )
);
∗context ;
The logical_or() returns an execution_context:
Each argument is asynchronously computed, independent of
each other:
the compiler attempts to verify that they are pure, with no
side-effects (they could be memoized),
no short-circuiting, so breaks the C++ Standard.
Once the results are available, the execution_context is
released.
logical_and() has been equivalently implemented.
25. Example using boost::bind.
Listing 7: Boost Bind with a Thread Pool and a Future.
struct res_t {
int i ;
};
struct work_type {
res_t exec () { r e t u r n res_t ( ) ; }
};
pool_type pool ( 2 ) ;
execution_context context (
p o o l < j o i n a b l e ()<< b o o s t : : b i n d (& work_type : : e x e c , work_type ( ) )
<
);
context− ;
>i
Boost is a very important C++ library: transferring
boost::bind() is supported, returning execution_context:
No unbound arguments nor placeholders in the call to
boost::bind().
Result-type of the function call should be copy-constructible.
Contrast with the reference argument result-type, earlier.
The cost: stack-space for the function-pointer & bound
arguments.
Completion of mutation releases the execution_context.
26. Some Thoughts Arising from the Examples.
The read-locks on the collections assist the user in avoiding
operations that might invalidate iterators on those collections.
Internally PPD creates tasks that recursively divide the
collections into p equal-length sub-sequences, where p is the
number of threads in the pool, at most O (log (min (p, n))).
The operations on the elements should be thread-safe & must
not affect the validity of iterators on those collections.
The user may call many STL-style algorithms in succession,
each one placing work into the thread pool to be operated
upon, and later recall the results or note the completion of
those operations via the returned execution_contexts.
27. Sequential Operation.
Recall the thread_pool typedef that contained the API and
threading model:
Listing 8: Definitions of the typedef for the thread pool.
t y p e d e f ppd : : t h r e a d _ p o o l <
p o o l _ t r a i t s : : worker_threads_get_work , p o o l _ t r a i t s : : f i x e d _ s i z e ,
g e n e r i c _ t r a i t s : : j o i n a b l e , platform_api , heavyweight_threading ,
pool_traits : : normal_fifo , std : less , 1
> pool_type ;
To make the thread_pool and therefore all operations on it
sequential, and remove all threading operations, one merely needs
to replace the trait heavyweight_threading with the trait
sequential.
It is that simple, no other changes need be made to the client
code, so all threading-related bugs have been removed.
This makes debugging significantly easier by separating the
business logic from the threading algorithms.
28. Other Features.
Other parallel algorithms implemented:
unary_fun() and binary_fun() which would be used to
execute the operands in parallel, in a similar manner to the
similarly-named STL binary functions.
Examples omitted for brevity.
Able to provide estimates of:
The optimal number of threads required to complete the
current work-load.
The minimum time to complete the current work-load given
the current number of threads.
This could provide:
An indication of soft-real-time performance.
Furthermore this could be used to implement dynamic tuning
of the thread pools.
29. Conclusions.
Recollecting my original problem statement, what has been
achieved?
A library that assists with creating deadlock and race-condition
free schedules.
Given the typedefs, a procedural way to automatically convert
STL algorithms into efficient threaded ones.
The ability to add to the business logic parallel algorithms that
are sufficiently separated to allow relatively simple debugging
and testing of the business logic independently of the parallel
algorithms.
All this in a traditionally hard-to-thread imperative language,
such as C++, indicating that it is possible to re-implement
such an example library in many other programming languages.
These are, in my opinion, rather useful features of the library!