A programming model which allows users to program with high productivity and which produces high performance executions has been a goal for decades. This dissertation makes progress towards this elusive goal by describing the design and implementation of the Galois system, a parallel programming model for shared-memory, multicore machines. Central to the design is the idea that scheduling of a program can be decoupled from the core computational operator and data structures. However, efficient programs often require application- specific scheduling to achieve best performance. To bridge this gap, an extensible and abstract scheduling policy language is proposed, which allows programmers to focus on selecting high-level scheduling policies while delegating the tedious task of implementing the policy to a scheduler synthesizer and runtime system. Implementations of deterministic and prioritized scheduling also are described.
An evaluation of a well-studied benchmark suite reveals that factoring programs into operators, schedulers and data structures can produce significant performance improvements over unfactored approaches. Comparison of the Galois system with existing programming models for graph analytics shows significant performance improvements, often orders of magnitude more, due to (1) better support for the restrictive programming models of existing systems and (2) better support for more sophisticated algorithms and scheduling, which cannot be expressed in other systems.
The document discusses new features in Java 8 including default methods in interfaces, lambda expressions, and streams. Default methods allow interfaces to include method implementations. Lambda expressions allow treating functionality as a method argument and code as data. Streams provide utilities for functional-style operations on collections of values and include intermediate and terminal operations.
This document discusses the author's first experience using lambda expressions in Java 8. It provides instructions on setting up Java 8 and examples of using lambda expressions for threads, collections, and mapping/reducing streams. Lambda expressions allow passing functionality as arguments to replace anonymous inner classes. The examples demonstrate using lambda expressions for runnables, foreach loops, and mapping/reducing operations. References for further reading on Java 8 lambdas are also included.
Quicksort Algorithm..simply defined through animations..!!Mahesh Tibrewal
I've seen many ppts on Quicksort algorithm which are very good but fail to maintain simplicity and clarity. So, I've prepared a very simple ppt for students to understand the operation in a better way.
This document summarizes and compares various scheduling algorithms for real-time systems, including:
1) Non-real-time algorithms like First Come First Served (FCFS), Shortest Job First (SJF), Round Robin (RR), and Priority Scheduling.
2) Real-time aperiodic task algorithms like Earliest Deadline First (EDF) and Jackson's Earliest Due Date (EDD).
3) Real-time periodic task algorithms like Rate Monotonic (RM), Deadline Monotonic (DM), and Earliest Deadline First Scheduling (EDF).
It discusses properties, constraints, implementation details, and comparisons between static priority and dynamic priority scheduling
The document proposes translating rules expressed in Spatio-Temporal Reach and Escape Logic (STREL) to streaming-based monitoring applications. STREL allows expressing properties over attributes that vary in space and time. The contribution is defining streaming operators whose semantics enforce STREL rules by composing base streaming operators. An evaluation on Apache Flink shows the approach can achieve throughput of 1000-500 tuples/second and sub-millisecond latency depending on spatial and temporal resolution of the data. Future work includes further evaluation, additional temporal operators, path-based spatial analysis, and compilation optimizations.
This document discusses Java 8 streams and how they are implemented under the hood. Streams use a pipeline concept where the output of one unit becomes the input of the next. Streams are built on functional interfaces and lambdas introduced in Java 8. Spliterators play an important role in enabling parallel processing of streams by splitting data sources into multiple chunks that can be processed independently in parallel. The reference pipeline implementation uses a linked list of processing stages to represent the stream operations and evaluates whether operations can be parallelized to take advantage of multiple threads or cores.
Transformer Mods for Document Length InputsSujit Pal
The Transformer architecture is responsible for many state of the art results in Natural Language Processing. A central feature behind its superior performance over Recurrent Neural Networks is its multi-headed self-attention mechanism. However, the superior performance comes at a cost, an O(n2) time and memory complexity, where n is the size of the input sequence. Because of this, it is computationally infeasible to feed large documents to the standard transformer. To overcome this limitation, a number of approaches have been proposed, which involve modifying the self-attention mechanism in interesting ways.
In this presentation, I will describe the transformer architecture, and specifically the self-attention mechanism, and then describe some of the approaches proposed to address the O(n2) complexity. Some of these approaches have also been implemented in the HuggingFace transformers library, and I will demonstrate some code for doing document level operations using one of these approaches.
Reactive System
The intuition is that a transition system consists of a set of possible states for the system and a set of transitions - or state changes - which the system can effect.
When a state change is the result of an external event or of an action made by the system, then that transition is labeled with that event or action.
Particular states or transitions in a transition system can be distinguished.
The document discusses new features in Java 8 including default methods in interfaces, lambda expressions, and streams. Default methods allow interfaces to include method implementations. Lambda expressions allow treating functionality as a method argument and code as data. Streams provide utilities for functional-style operations on collections of values and include intermediate and terminal operations.
This document discusses the author's first experience using lambda expressions in Java 8. It provides instructions on setting up Java 8 and examples of using lambda expressions for threads, collections, and mapping/reducing streams. Lambda expressions allow passing functionality as arguments to replace anonymous inner classes. The examples demonstrate using lambda expressions for runnables, foreach loops, and mapping/reducing operations. References for further reading on Java 8 lambdas are also included.
Quicksort Algorithm..simply defined through animations..!!Mahesh Tibrewal
I've seen many ppts on Quicksort algorithm which are very good but fail to maintain simplicity and clarity. So, I've prepared a very simple ppt for students to understand the operation in a better way.
This document summarizes and compares various scheduling algorithms for real-time systems, including:
1) Non-real-time algorithms like First Come First Served (FCFS), Shortest Job First (SJF), Round Robin (RR), and Priority Scheduling.
2) Real-time aperiodic task algorithms like Earliest Deadline First (EDF) and Jackson's Earliest Due Date (EDD).
3) Real-time periodic task algorithms like Rate Monotonic (RM), Deadline Monotonic (DM), and Earliest Deadline First Scheduling (EDF).
It discusses properties, constraints, implementation details, and comparisons between static priority and dynamic priority scheduling
The document proposes translating rules expressed in Spatio-Temporal Reach and Escape Logic (STREL) to streaming-based monitoring applications. STREL allows expressing properties over attributes that vary in space and time. The contribution is defining streaming operators whose semantics enforce STREL rules by composing base streaming operators. An evaluation on Apache Flink shows the approach can achieve throughput of 1000-500 tuples/second and sub-millisecond latency depending on spatial and temporal resolution of the data. Future work includes further evaluation, additional temporal operators, path-based spatial analysis, and compilation optimizations.
This document discusses Java 8 streams and how they are implemented under the hood. Streams use a pipeline concept where the output of one unit becomes the input of the next. Streams are built on functional interfaces and lambdas introduced in Java 8. Spliterators play an important role in enabling parallel processing of streams by splitting data sources into multiple chunks that can be processed independently in parallel. The reference pipeline implementation uses a linked list of processing stages to represent the stream operations and evaluates whether operations can be parallelized to take advantage of multiple threads or cores.
Transformer Mods for Document Length InputsSujit Pal
The Transformer architecture is responsible for many state of the art results in Natural Language Processing. A central feature behind its superior performance over Recurrent Neural Networks is its multi-headed self-attention mechanism. However, the superior performance comes at a cost, an O(n2) time and memory complexity, where n is the size of the input sequence. Because of this, it is computationally infeasible to feed large documents to the standard transformer. To overcome this limitation, a number of approaches have been proposed, which involve modifying the self-attention mechanism in interesting ways.
In this presentation, I will describe the transformer architecture, and specifically the self-attention mechanism, and then describe some of the approaches proposed to address the O(n2) complexity. Some of these approaches have also been implemented in the HuggingFace transformers library, and I will demonstrate some code for doing document level operations using one of these approaches.
Reactive System
The intuition is that a transition system consists of a set of possible states for the system and a set of transitions - or state changes - which the system can effect.
When a state change is the result of an external event or of an action made by the system, then that transition is labeled with that event or action.
Particular states or transitions in a transition system can be distinguished.
Tech Days 2015: User Presentation Vermont Technical CollegeAdaCore
The document discusses lessons learned from NASA's ELaNa IV CubeSat launch initiative. It notes that of the 12 university CubeSats launched, only 4 were heard from and most were non-functional due to issues like software errors. The document advocates for using the SPARK/Ada programming language and static analysis tools for CubeSat software to improve reliability, noting their CubeSat was the only ELaNa IV satellite still operating. It also describes new CubeSat software called CubedOS being developed in SPARK/Ada to provide a more robust and reusable framework.
This document provides an overview of transfer functions and stability analysis of linear time-invariant (LTI) systems. It discusses how the Laplace transform can be used to represent signals as algebraic functions and calculate transfer functions as the ratio of the Laplace transforms of the output and input. Poles and zeros are introduced as important factors for stability. A system is stable if all its poles reside in the left half of the s-plane and unstable if any pole resides in the right half-plane. Examples are provided to demonstrate calculating transfer functions from differential equations and analyzing stability based on pole locations.
This document provides an overview of the evolution of the java.util.concurrent.ConcurrentHashMap class. It discusses the motivations and improvements between different versions (v5, v6, v8) of the implementation, moving from using segments with reentrant locks in v5, to using unsafe operations and spin locks in v6, and removing segments and using nodes as locks in v8. It also describes new bulk operations introduced in v8 like search, forEach, and reduce.
Innovative Solar Array Drive Assembly for CubeSat SatelliteMichele Marino
The CubeSat satellite is a smart option for reliable and low cost space mission development. Growing
CubeSat performances lead to more extensive nanosatellite application. Currently,
Telecommunication and Earth Observation missions are under development both in single and
constellation configurations. The main targets for the future nanosatellite are: accurate attitude
pointing, high data rate transfer, increased power generation. The on board power/energy availability
reduces or limits the CubeSat performances in terms of processing capabilities, power transmission
and attitude/orbit maneuvers. Following these constraints, the IMT has developed an innovative unit,
named nano-Solar Array Drive Assembly (SADA) for 3U CubeSat, with the aim of increasing the
photovoltaic energy generation (up to an average 35W EOL). It is composed by two independent
Solar Arrays (Wings Assembly) and Rotatory Mechanisms / Logical Unit (SAC – Solar Array
Control). The aim of SADA is to align constantly the two Solar Arrays to the Sun direction, around
one axis. The rotatory system is composed by drive gear sets, stepper motors and slip rings. The high
value of gearhead reduction ratio and two dedicated photodiodes (as solar sensors) allow a fine
pointing accuracy (<5°). Several operation modes are implemented and controlled by the On Board
Computer through the I2C and CAN buses: autonomous (sun detection and pointing), slave or
cooperative. An advanced and smart control algorithm was developed and implemented in the logic
unit. The Solar Array points along the maximum solar flux direction, maximum output speed up to
4°/s (step size 0.004°). A system failure control avoids the thermal and power damaging in case one
or both wings are blocked. SADA is fully compliant with all CubeSat form factor (3U or greater) and
BUS (CSKB – CubeSat Kit Bus). The Solar Wings, during the launch phase, are stowed beside the
CubeSat structure (opposite side faces). The overall thickness is less than 9 mm, compliant to ISIPOD
dispenser. The Logical and Drive unit (SAC), small (90 x 90 x 12 mm) and light (185 gr), is allocated inside the satellite. The Wings are electrically connected to the SAC, by means of two 16 channels
slip rings (1A per contact) for a continuous rotation, without cable saturation. The Alignment
Calibration System assures that the unit runs correctly up to 10 mm of misalignment between the
SAC and the geometric satellite center, along Z direction. The generated power is not handled by
SAC, but by PDU through Standard Molex Connectors. The two wings, stowed during the launch
phase, are deployed in orbit. In order to increase the system reliability, the deployment is based on
two redundant thermal cutter systems. In the final configuration, the 3U CubeSat has two wings, each
one 300 x 300 mm and 36 AzurSpace 3J solar cells.
ME-314 Introduction to Control Engineering is a course taught to Mechanical Engineering senior undergrads. The course is taught by Dr. Bilal Siddiqui at DHA Suffa University. This lecture is about block diagram reduction for finding closed loop transfer functions.
PopMAG: Pop Music Accompaniment Generationivaderivader
This paper introduces PopMAG, a model that generates pop music accompaniment across multiple tracks simultaneously. It uses a novel multi-track MIDI representation called MuMIDI that encodes notes from different tracks into a single sequence, allowing the model to explicitly capture dependencies between tracks. The model achieves state-of-the-art results on three datasets based on both subjective listener evaluations and objective metrics. Ablation studies demonstrate the effectiveness of modeling notes as single tokens, using context memory in the encoder/decoder, and including bar/position embeddings.
This document summarizes algorithms for solving the lowest common ancestor (LCA) problem and range minimum query (RMQ) problem in trees. It presents a reduction from LCA to RMQ that allows solving LCA in O(n) time using an O(n) time RMQ algorithm. It describes several RMQ algorithms, including a naive O(n^3) time algorithm, an O(n^2) dynamic programming algorithm, and an O(n log n) sparse table algorithm. For the special case where the values in the array differ by at most 1, it presents an O(n) time and O(1) time RMQ algorithm based on partitioning the array into blocks.
Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...Dave Callen
This document summarizes the Atlas V Aft Bulkhead Carrier (ABC) system used to deploy cubesats from the aft end of the Centaur upper stage. It provides an overview of the ABC, details on past and upcoming missions using it, lessons learned, and planned improvements. The ABC allows deployment of small payloads up to 80kg and has supported over 30 cubesats to date without affecting the primary payload. Future enhancements include reducing predicted vibration environments and qualifying additional 6U cubesat deployers.
This document summarizes a numerical analysis project using the Euler method to simulate projectile motion in MATLAB. It contains sections on the Euler method, applying it to physical systems like projectile motion, implementing it theoretically and in MATLAB, and presenting the output and conclusions. The group members are listed and the document contains the equations of motion for a projectile and details on setting up and running the simulation in MATLAB.
Augmenting Field Data for Testing Systems Subject to Incremental Requirements...Lionel Briand
The document describes a technique for automatically generating test input data for systems subject to incremental requirements changes. It involves capturing new data requirements in an updated data model, generating an incomplete updated model instance, and using slicing-based constraint solving and model transformation to produce a valid updated model instance satisfying all constraints. This allows test cases to be reused when requirements and data fields are changed.
Performance van Java 8 en verder - Jeroen BorgersNLJUG
We weten allemaal dat de grootste verbetering die Java 8 brengt de ondersteuning voor lambda-expressies is. Dit introduceert functioneel programmeren in Java. Door het toevoegen van de Stream API wordt deze verbetering nog groter: iteratie kan nu intern worden afgehandeld door een bibliotheek, je kunt daarmee nu het beginsel "Tell, don’t ask" toepassen op collecties. Je kunt gewoon vertellen dat er een ??functie uitgevoerd moet worden op je verzameling, of vertellen dat dat parallel, door meerdere cores moet gebeuren. Maar wat betekent dit voor de prestaties van onze Java-toepassingen? Kunnen we nu meteen volledig al onze CPU-cores benutten om betere responstijden te krijgen? Hoe werken filter / map / reduce en parallele streams precies intern? Hoe wordt het Fork-Join framework hierin gebruikt? Zijn lambda's sneller dan inner klassen? - Al deze vragen worden beantwoord in deze sessie. Daarnaast introduceert Java 8 meer performance verbeteringen: tiered compilatie, PermGen verwijdering, java.time, Accumulators, Adders en Map verbeteringen. Ten slotte zullen we ook een kijkje nemen in de keuken van de geplande performance verbeteringen voor Java 9: benutting van GPU's, Value Types en arrays 2.0.
This document provides an overview of algorithm analysis and determining the time complexity of algorithms. It discusses that the time an algorithm takes to run can be estimated by counting the number of basic operations and expressing the runtime using asymptotic notation. Examples are provided to demonstrate how to analyze the runtime of simple algorithms with loops and nested loops. The key growth rates like constant, linear, quadratic, and exponential are defined. Determining the highest order term provides the overall time complexity of an algorithm in Big O notation.
Generalized Linear Models in Spark MLlib and SparkRDatabricks
Generalized linear models (GLMs) unify various statistical models such as linear regression and logistic regression through the specification of a model family and link function. They are widely used in modeling, inference, and prediction with applications in numerous fields. In this talk, we will summarize recent community efforts in supporting GLMs in Spark MLlib and SparkR. We will review supported model families, link functions, and regularization types, as well as their use cases, e.g., logistic regression for classification and log-linear model for survival analysis. Then we discuss the choices of solvers and their pros and cons given training datasets of different sizes, and implementation details in order to match R’s model output and summary statistics. We will also demonstrate the APIs in MLlib and SparkR, including R model formula support, which make building linear models a simple task in Spark. This is a joint work with Eric Liang, Yanbo Liang, and some other Spark contributors.
A Lightweight Infrastructure for Graph AnalyticsDonald Nguyen
Several domain-specific languages (DSLs) for parallel graph analytics have been proposed recently. In this pa- per, we argue that existing DSLs can be implemented on top of a general-purpose infrastructure that (i) supports very fine-grain tasks, (ii) implements autonomous, speculative execution of these tasks, and (iii) allows application-specific control of task scheduling policies. To support this claim, we describe such an implementation called the Galois system.
We demonstrate the capabilities of this infrastructure in three ways. First, we implement more sophisticated algorithms for some of the graph analytics problems tack- led by previous DSLs and show that end-to-end performance can be improved by orders of magnitude even on power-law graphs, thanks to the better algorithms facilitated by a more general programming model. Second, we show that, even when an algorithm can be expressed in existing DSLs, the implementation of that algorithm in the more general system can be orders of magnitude faster when the input graphs are road networks and similar graphs with high diameter, thanks to more sophisticated scheduling. Third, we implement the APIs of three existing graph DSLs on top of the common infrastructure in a few hundred lines of code and show that even for power-law graphs, the performance of the resulting implementations often exceeds that of the original DSL systems, thanks to the lightweight infrastructure.
This document outlines the syllabus for a course on data structures and algorithms using Java. It covers topics such as the role of algorithms and data structures, algorithm design techniques, types of data structures including primitive types, arrays, stacks, queues, linked lists, trees, graphs, and algorithm analysis. Specific algorithms and data structures discussed include sorting, searching, priority queues, stacks, queues, linked lists, trees, graphs, hashing, and complexity theory.
Data structure and algorithm using javaNarayan Sau
This presentation created for people who like to go back to basics of data structure and its implementation. This presentation mostly helps B.Tech , Bsc Computer science students as well as all programmer who wants to develop software in core areas.
Here is a recursive function to check if a list contains an element:
(defun contains (element list)
(cond ((null list) nil)
((equal element (car list)) t)
(t (contains element (cdr list)))))
To check the guest list:
(contains 'robocop guest-list)
This function:
1. Base case: If list is empty, element is not contained - return nil
2. Check if element equals car of list - if so, return t
3. Otherwise, recursively call contains on element and cdr of list
So it will recursively traverse the list until it finds a match or reaches empty list.
Parallel processing involves executing multiple tasks simultaneously using multiple cores or processors. It can provide performance benefits over serial processing by reducing execution time. When developing parallel applications, developers must identify independent tasks that can be executed concurrently and avoid issues like race conditions and deadlocks. Effective parallelization requires analyzing serial code to find optimization opportunities, designing and implementing concurrent tasks, and testing and tuning to maximize performance gains.
Deterministic Galois: On-demand, Portable and ParameterlessDonald Nguyen
Non-determinism in program execution can make program development and debugging difficult. In this paper, we argue that solutions to this problem should be on-demand, portable and parameterless. On-demand means that the programming model should permit the writing of non-deterministic pro- grams since these programs often perform better than deterministic programs for the same problem. Portable means that the program should produce the same answer even if it is run on different machines. Parameterless means that if there are machine-dependent scheduling parameters that must be tuned for good performance, they must not affect the output.
Although many solutions for deterministic program execution have been proposed in the literature, they fall short along one or more of these dimensions. To remedy this, we propose a new approach, based on the Galois programming model, in which (i) the programming model permits the writing of non-deterministic programs and (ii) the runtime system executes these programs deterministically if needed. Evaluation of this approach on a collection of benchmarks from the PARSEC, PBBS, and Lonestar suites shows that it delivers deterministic execution with substantially less overhead than other systems in the literature.
Linear search examines each element of a list sequentially, one by one, and checks if it is the target value. It has a time complexity of O(n) as it requires searching through each element in the worst case. While simple to implement, linear search is inefficient for large lists as other algorithms like binary search require fewer comparisons.
The big language features for Java SE 8 are lambda expressions (a.k.a. closures) and default methods (a.k.a. virtual extension methods). Adding closures to the Java language opens up a host of new expressive opportunities for applications and libraries, but how are they implemented? You might assume that lambda expressions are simply a compact syntax for inner classes, but, in fact, the implementation of lambda expressions is substantially different and builds on the invokedynamic feature added in Java SE 7.
Tech Days 2015: User Presentation Vermont Technical CollegeAdaCore
The document discusses lessons learned from NASA's ELaNa IV CubeSat launch initiative. It notes that of the 12 university CubeSats launched, only 4 were heard from and most were non-functional due to issues like software errors. The document advocates for using the SPARK/Ada programming language and static analysis tools for CubeSat software to improve reliability, noting their CubeSat was the only ELaNa IV satellite still operating. It also describes new CubeSat software called CubedOS being developed in SPARK/Ada to provide a more robust and reusable framework.
This document provides an overview of transfer functions and stability analysis of linear time-invariant (LTI) systems. It discusses how the Laplace transform can be used to represent signals as algebraic functions and calculate transfer functions as the ratio of the Laplace transforms of the output and input. Poles and zeros are introduced as important factors for stability. A system is stable if all its poles reside in the left half of the s-plane and unstable if any pole resides in the right half-plane. Examples are provided to demonstrate calculating transfer functions from differential equations and analyzing stability based on pole locations.
This document provides an overview of the evolution of the java.util.concurrent.ConcurrentHashMap class. It discusses the motivations and improvements between different versions (v5, v6, v8) of the implementation, moving from using segments with reentrant locks in v5, to using unsafe operations and spin locks in v6, and removing segments and using nodes as locks in v8. It also describes new bulk operations introduced in v8 like search, forEach, and reduce.
Innovative Solar Array Drive Assembly for CubeSat SatelliteMichele Marino
The CubeSat satellite is a smart option for reliable and low cost space mission development. Growing
CubeSat performances lead to more extensive nanosatellite application. Currently,
Telecommunication and Earth Observation missions are under development both in single and
constellation configurations. The main targets for the future nanosatellite are: accurate attitude
pointing, high data rate transfer, increased power generation. The on board power/energy availability
reduces or limits the CubeSat performances in terms of processing capabilities, power transmission
and attitude/orbit maneuvers. Following these constraints, the IMT has developed an innovative unit,
named nano-Solar Array Drive Assembly (SADA) for 3U CubeSat, with the aim of increasing the
photovoltaic energy generation (up to an average 35W EOL). It is composed by two independent
Solar Arrays (Wings Assembly) and Rotatory Mechanisms / Logical Unit (SAC – Solar Array
Control). The aim of SADA is to align constantly the two Solar Arrays to the Sun direction, around
one axis. The rotatory system is composed by drive gear sets, stepper motors and slip rings. The high
value of gearhead reduction ratio and two dedicated photodiodes (as solar sensors) allow a fine
pointing accuracy (<5°). Several operation modes are implemented and controlled by the On Board
Computer through the I2C and CAN buses: autonomous (sun detection and pointing), slave or
cooperative. An advanced and smart control algorithm was developed and implemented in the logic
unit. The Solar Array points along the maximum solar flux direction, maximum output speed up to
4°/s (step size 0.004°). A system failure control avoids the thermal and power damaging in case one
or both wings are blocked. SADA is fully compliant with all CubeSat form factor (3U or greater) and
BUS (CSKB – CubeSat Kit Bus). The Solar Wings, during the launch phase, are stowed beside the
CubeSat structure (opposite side faces). The overall thickness is less than 9 mm, compliant to ISIPOD
dispenser. The Logical and Drive unit (SAC), small (90 x 90 x 12 mm) and light (185 gr), is allocated inside the satellite. The Wings are electrically connected to the SAC, by means of two 16 channels
slip rings (1A per contact) for a continuous rotation, without cable saturation. The Alignment
Calibration System assures that the unit runs correctly up to 10 mm of misalignment between the
SAC and the geometric satellite center, along Z direction. The generated power is not handled by
SAC, but by PDU through Standard Molex Connectors. The two wings, stowed during the launch
phase, are deployed in orbit. In order to increase the system reliability, the deployment is based on
two redundant thermal cutter systems. In the final configuration, the 3U CubeSat has two wings, each
one 300 x 300 mm and 36 AzurSpace 3J solar cells.
ME-314 Introduction to Control Engineering is a course taught to Mechanical Engineering senior undergrads. The course is taught by Dr. Bilal Siddiqui at DHA Suffa University. This lecture is about block diagram reduction for finding closed loop transfer functions.
PopMAG: Pop Music Accompaniment Generationivaderivader
This paper introduces PopMAG, a model that generates pop music accompaniment across multiple tracks simultaneously. It uses a novel multi-track MIDI representation called MuMIDI that encodes notes from different tracks into a single sequence, allowing the model to explicitly capture dependencies between tracks. The model achieves state-of-the-art results on three datasets based on both subjective listener evaluations and objective metrics. Ablation studies demonstrate the effectiveness of modeling notes as single tokens, using context memory in the encoder/decoder, and including bar/position embeddings.
This document summarizes algorithms for solving the lowest common ancestor (LCA) problem and range minimum query (RMQ) problem in trees. It presents a reduction from LCA to RMQ that allows solving LCA in O(n) time using an O(n) time RMQ algorithm. It describes several RMQ algorithms, including a naive O(n^3) time algorithm, an O(n^2) dynamic programming algorithm, and an O(n log n) sparse table algorithm. For the special case where the values in the array differ by at most 1, it presents an O(n) time and O(1) time RMQ algorithm based on partitioning the array into blocks.
Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...Dave Callen
This document summarizes the Atlas V Aft Bulkhead Carrier (ABC) system used to deploy cubesats from the aft end of the Centaur upper stage. It provides an overview of the ABC, details on past and upcoming missions using it, lessons learned, and planned improvements. The ABC allows deployment of small payloads up to 80kg and has supported over 30 cubesats to date without affecting the primary payload. Future enhancements include reducing predicted vibration environments and qualifying additional 6U cubesat deployers.
This document summarizes a numerical analysis project using the Euler method to simulate projectile motion in MATLAB. It contains sections on the Euler method, applying it to physical systems like projectile motion, implementing it theoretically and in MATLAB, and presenting the output and conclusions. The group members are listed and the document contains the equations of motion for a projectile and details on setting up and running the simulation in MATLAB.
Augmenting Field Data for Testing Systems Subject to Incremental Requirements...Lionel Briand
The document describes a technique for automatically generating test input data for systems subject to incremental requirements changes. It involves capturing new data requirements in an updated data model, generating an incomplete updated model instance, and using slicing-based constraint solving and model transformation to produce a valid updated model instance satisfying all constraints. This allows test cases to be reused when requirements and data fields are changed.
Performance van Java 8 en verder - Jeroen BorgersNLJUG
We weten allemaal dat de grootste verbetering die Java 8 brengt de ondersteuning voor lambda-expressies is. Dit introduceert functioneel programmeren in Java. Door het toevoegen van de Stream API wordt deze verbetering nog groter: iteratie kan nu intern worden afgehandeld door een bibliotheek, je kunt daarmee nu het beginsel "Tell, don’t ask" toepassen op collecties. Je kunt gewoon vertellen dat er een ??functie uitgevoerd moet worden op je verzameling, of vertellen dat dat parallel, door meerdere cores moet gebeuren. Maar wat betekent dit voor de prestaties van onze Java-toepassingen? Kunnen we nu meteen volledig al onze CPU-cores benutten om betere responstijden te krijgen? Hoe werken filter / map / reduce en parallele streams precies intern? Hoe wordt het Fork-Join framework hierin gebruikt? Zijn lambda's sneller dan inner klassen? - Al deze vragen worden beantwoord in deze sessie. Daarnaast introduceert Java 8 meer performance verbeteringen: tiered compilatie, PermGen verwijdering, java.time, Accumulators, Adders en Map verbeteringen. Ten slotte zullen we ook een kijkje nemen in de keuken van de geplande performance verbeteringen voor Java 9: benutting van GPU's, Value Types en arrays 2.0.
This document provides an overview of algorithm analysis and determining the time complexity of algorithms. It discusses that the time an algorithm takes to run can be estimated by counting the number of basic operations and expressing the runtime using asymptotic notation. Examples are provided to demonstrate how to analyze the runtime of simple algorithms with loops and nested loops. The key growth rates like constant, linear, quadratic, and exponential are defined. Determining the highest order term provides the overall time complexity of an algorithm in Big O notation.
Generalized Linear Models in Spark MLlib and SparkRDatabricks
Generalized linear models (GLMs) unify various statistical models such as linear regression and logistic regression through the specification of a model family and link function. They are widely used in modeling, inference, and prediction with applications in numerous fields. In this talk, we will summarize recent community efforts in supporting GLMs in Spark MLlib and SparkR. We will review supported model families, link functions, and regularization types, as well as their use cases, e.g., logistic regression for classification and log-linear model for survival analysis. Then we discuss the choices of solvers and their pros and cons given training datasets of different sizes, and implementation details in order to match R’s model output and summary statistics. We will also demonstrate the APIs in MLlib and SparkR, including R model formula support, which make building linear models a simple task in Spark. This is a joint work with Eric Liang, Yanbo Liang, and some other Spark contributors.
A Lightweight Infrastructure for Graph AnalyticsDonald Nguyen
Several domain-specific languages (DSLs) for parallel graph analytics have been proposed recently. In this pa- per, we argue that existing DSLs can be implemented on top of a general-purpose infrastructure that (i) supports very fine-grain tasks, (ii) implements autonomous, speculative execution of these tasks, and (iii) allows application-specific control of task scheduling policies. To support this claim, we describe such an implementation called the Galois system.
We demonstrate the capabilities of this infrastructure in three ways. First, we implement more sophisticated algorithms for some of the graph analytics problems tack- led by previous DSLs and show that end-to-end performance can be improved by orders of magnitude even on power-law graphs, thanks to the better algorithms facilitated by a more general programming model. Second, we show that, even when an algorithm can be expressed in existing DSLs, the implementation of that algorithm in the more general system can be orders of magnitude faster when the input graphs are road networks and similar graphs with high diameter, thanks to more sophisticated scheduling. Third, we implement the APIs of three existing graph DSLs on top of the common infrastructure in a few hundred lines of code and show that even for power-law graphs, the performance of the resulting implementations often exceeds that of the original DSL systems, thanks to the lightweight infrastructure.
This document outlines the syllabus for a course on data structures and algorithms using Java. It covers topics such as the role of algorithms and data structures, algorithm design techniques, types of data structures including primitive types, arrays, stacks, queues, linked lists, trees, graphs, and algorithm analysis. Specific algorithms and data structures discussed include sorting, searching, priority queues, stacks, queues, linked lists, trees, graphs, hashing, and complexity theory.
Data structure and algorithm using javaNarayan Sau
This presentation created for people who like to go back to basics of data structure and its implementation. This presentation mostly helps B.Tech , Bsc Computer science students as well as all programmer who wants to develop software in core areas.
Here is a recursive function to check if a list contains an element:
(defun contains (element list)
(cond ((null list) nil)
((equal element (car list)) t)
(t (contains element (cdr list)))))
To check the guest list:
(contains 'robocop guest-list)
This function:
1. Base case: If list is empty, element is not contained - return nil
2. Check if element equals car of list - if so, return t
3. Otherwise, recursively call contains on element and cdr of list
So it will recursively traverse the list until it finds a match or reaches empty list.
Parallel processing involves executing multiple tasks simultaneously using multiple cores or processors. It can provide performance benefits over serial processing by reducing execution time. When developing parallel applications, developers must identify independent tasks that can be executed concurrently and avoid issues like race conditions and deadlocks. Effective parallelization requires analyzing serial code to find optimization opportunities, designing and implementing concurrent tasks, and testing and tuning to maximize performance gains.
Deterministic Galois: On-demand, Portable and ParameterlessDonald Nguyen
Non-determinism in program execution can make program development and debugging difficult. In this paper, we argue that solutions to this problem should be on-demand, portable and parameterless. On-demand means that the programming model should permit the writing of non-deterministic pro- grams since these programs often perform better than deterministic programs for the same problem. Portable means that the program should produce the same answer even if it is run on different machines. Parameterless means that if there are machine-dependent scheduling parameters that must be tuned for good performance, they must not affect the output.
Although many solutions for deterministic program execution have been proposed in the literature, they fall short along one or more of these dimensions. To remedy this, we propose a new approach, based on the Galois programming model, in which (i) the programming model permits the writing of non-deterministic programs and (ii) the runtime system executes these programs deterministically if needed. Evaluation of this approach on a collection of benchmarks from the PARSEC, PBBS, and Lonestar suites shows that it delivers deterministic execution with substantially less overhead than other systems in the literature.
Linear search examines each element of a list sequentially, one by one, and checks if it is the target value. It has a time complexity of O(n) as it requires searching through each element in the worst case. While simple to implement, linear search is inefficient for large lists as other algorithms like binary search require fewer comparisons.
The big language features for Java SE 8 are lambda expressions (a.k.a. closures) and default methods (a.k.a. virtual extension methods). Adding closures to the Java language opens up a host of new expressive opportunities for applications and libraries, but how are they implemented? You might assume that lambda expressions are simply a compact syntax for inner classes, but, in fact, the implementation of lambda expressions is substantially different and builds on the invokedynamic feature added in Java SE 7.
This document discusses algorithms and their analysis. It defines an algorithm as a finite sequence of unambiguous instructions that terminate in a finite amount of time. It discusses areas of study like algorithm design techniques, analysis of time and space complexity, testing and validation. Common algorithm complexities like constant, logarithmic, linear, quadratic and exponential are explained. Performance analysis techniques like asymptotic analysis and amortized analysis using aggregate analysis, accounting method and potential method are also summarized.
Using Groovy? Got lots of stuff to do at the same time? Then you need to take a look at GPars (“Jeepers!”), a library providing support for concurrency and parallelism in Groovy. GPars brings powerful concurrency models from other languages to Groovy and makes them easy to use with custom DSLs:
- Actors (Erlang and Scala)
- Dataflow (Io)
- Fork/join (Java)
- Agent (Clojure agents)
In addition to this support, GPars integrates with standard Groovy frameworks like Grails and Griffon.
Background, comparisons to other languages, and motivating examples will be given for the major GPars features.
The document defines algorithms and describes their characteristics and design techniques. It states that an algorithm is a step-by-step procedure to solve a problem and get the desired output. It discusses algorithm development using pseudocode and flowcharts. Various algorithm design techniques like top-down, bottom-up, incremental, divide and conquer are explained. The document also covers algorithm analysis in terms of time and space complexity and asymptotic notations like Big-O, Omega and Theta to analyze best, average and worst case running times. Common time complexities like constant, linear, quadratic, and exponential are provided with examples.
This document discusses functional programming and its benefits. It begins with an overview of functional programming concepts like pure functions, referential transparency, and immutability. It then covers functional programming techniques like higher order functions, recursion, composition, and pattern matching. Examples are given comparing imperative and functional implementations for quicksort and optional types. The document argues that functional programming leads to cleaner code by improving modularity, testability and adherence to SOLID principles. It recommends starting with functional features in existing languages and learning Haskell to fully embrace the functional paradigm.
Hadoop classes in mumbai
best android classes in mumbai with job assistance.
our features are:
expert guidance by it industry professionals
lowest fees of 5000
practical exposure to handle projects
well equiped lab
after course resume writing guidance
Phases of distributed query processingNevil Dsouza
The document discusses phases of distributed query processing. The objectives are to convert SQL queries to relational algebra, minimize costs by executing queries efficiently across distributed databases, and simplify queries. Analysis involves lexical, syntactic and semantic analysis to verify relations and attributes. Simplification detects redundant predicates. Query restructuring rewrites queries as relational algebra expressions and applies transformation rules. Fragmentation forms queries for fragmented relations and reduction techniques optimize queries based on the fragmentation type.
This document discusses datapath design and arithmetic operations in computer architecture. It covers:
1) The design of circuits to implement basic fixed-point arithmetic instructions like addition, subtraction, multiplication, and division. Multiplication can be done with combinational or sequential circuits using an array of adders, while division is typically sequential using repeated subtraction.
2) The Arithmetic Logic Unit (ALU) is used to process arithmetic and logical instructions and employs a chain of identical 1-bit adders. Coprocessors can provide fast hardware implementations for complex arithmetic functions.
3) Pipeline processing is used to improve processor throughput by dividing arithmetic operations into stages to allow overlapped processing, at the cost of requiring more hardware resources
The document discusses functional programming concepts and provides examples in Python. It defines functional programming, compares it to procedural and object-oriented paradigms, and outlines key concepts like pure functions, recursion, immutable data, and higher-order functions. It also provides examples of map, filter and reduce functions in Python and discusses advantages of the functional style.
This document provides an overview of concepts in artificial intelligence robotics, including definitions of robots, tasks robots can perform, components of robots like effectors and sensors, and approaches to agent architectures, localization, mapping, planning, control and reactive control. Key points discussed include defining robots as programmable manipulators that perform tasks, the nondeterministic and dynamic nature of the real world environment, and methods like probabilistic roadmaps and potential fields for planning robot movements.
This document provides an introduction and overview of PSPICE, a circuit simulation software based on SPICE. It describes PSPICE's capabilities such as nonlinear DC analysis, transient analysis, and Fourier analysis. It outlines the types of circuit components that can be modeled in PSPICE such as resistors, capacitors, transistors. It also discusses key aspects of using PSPICE like libraries, subcircuits, file types and the basic steps for creating and simulating a circuit. An example of simulating an RC high pass filter is presented to demonstrate the use of PSPICE.
Similar to Galois: A System for Parallel Execution of Irregular Algorithms (20)
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
React.js, a JavaScript library developed by Facebook, has gained immense popularity for building user interfaces, especially for single-page applications. Over the years, React has evolved and expanded its capabilities, becoming a preferred choice for mobile app development. This article will explore why React.js is an excellent choice for the Best Mobile App development company in Noida.
Visit Us For Information: https://www.linkedin.com/pulse/what-makes-reactjs-stand-out-mobile-app-development-rajesh-rai-pihvf/
Flutter vs. React Native: A Detailed Comparison for App Development in 2024dhavalvaghelanectarb
Choosing the right framework for your cross-platform mobile app can be a tough decision. Both Flutter and React Native offer compelling features and have earned their place in the development world. Here is a detailed comparison to help you weigh their strengths and weaknesses. Here are the pros and cons of developing mobile apps in React Native vs Flutter.
Stork Product Overview: An AI-Powered Autonomous Delivery FleetVince Scalabrino
Imagine a world where instead of blue and brown trucks dropping parcels on our porches, a buzzing drove of drones delivered our goods. Now imagine those drones are controlled by 3 purpose-built AI designed to ensure all packages were delivered as quickly and as economically as possible That's what Stork is all about.
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio, Inc.
Alluxio Webinar
June. 18, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Jianjian Xie (Staff Software Engineer, Alluxio)
As Trino users increasingly rely on cloud object storage for retrieving data, speed and cloud cost have become major challenges. The separation of compute and storage creates latency challenges when querying datasets; scanning data between storage and compute tiers becomes I/O bound. On the other hand, cloud API costs related to GET/LIST operations and cross-region data transfer add up quickly.
The newly introduced Trino file system cache by Alluxio aims to overcome the above challenges. In this session, Jianjian will dive into Trino data caching strategies, the latest test results, and discuss the multi-level caching architecture. This architecture makes Trino 10x faster for data lakes of any scale, from GB to EB.
What you will learn:
- Challenges relating to the speed and costs of running Trino in the cloud
- The new Trino file system cache feature overview, including the latest development status and test results
- A multi-level cache framework for maximized speed, including Trino file system cache and Alluxio distributed cache
- Real-world cases, including a large online payment firm and a top ridesharing company
- The future roadmap of Trino file system cache and Trino-Alluxio integration
The Rising Future of CPaaS in the Middle East 2024Yara Milbes
Explore "The Rising Future of CPaaS in the Middle East in 2024" with this comprehensive PPT presentation. Discover how Communication Platforms as a Service (CPaaS) is transforming communication across various sectors in the Middle East.
14 th Edition of International conference on computer visionShulagnaSarkar2
About the event
14th Edition of International conference on computer vision
Computer conferences organized by ScienceFather group. ScienceFather takes the privilege to invite speakers participants students delegates and exhibitors from across the globe to its International Conference on computer conferences to be held in the Various Beautiful cites of the world. computer conferences are a discussion of common Inventions-related issues and additionally trade information share proof thoughts and insight into advanced developments in the science inventions service system. New technology may create many materials and devices with a vast range of applications such as in Science medicine electronics biomaterials energy production and consumer products.
Nomination are Open!! Don't Miss it
Visit: computer.scifat.com
Award Nomination: https://x-i.me/ishnom
Conference Submission: https://x-i.me/anicon
For Enquiry: Computer@scifat.com
How GenAI Can Improve Supplier Performance Management.pdfZycus
Data Collection and Analysis with GenAI enables organizations to gather, analyze, and visualize vast amounts of supplier data, identifying key performance indicators and trends. Predictive analytics forecast future supplier performance, mitigating risks and seizing opportunities. Supplier segmentation allows for tailored management strategies, optimizing resource allocation. Automated scorecards and reporting provide real-time insights, enhancing transparency and tracking progress. Collaboration is fostered through GenAI-powered platforms, driving continuous improvement. NLP analyzes unstructured feedback, uncovering deeper insights into supplier relationships. Simulation and scenario planning tools anticipate supply chain disruptions, supporting informed decision-making. Integration with existing systems enhances data accuracy and consistency. McKinsey estimates GenAI could deliver $2.6 trillion to $4.4 trillion in economic benefits annually across industries, revolutionizing procurement processes and delivering significant ROI.
In this infographic, we have explored cost-effective strategies for iOS app development, focusing on building high-quality apps within a budget. Key points covered include prioritizing essential features, leveraging existing tools and libraries, adopting cross-platform development approaches, optimizing for a Minimum Viable Product (MVP), and integrating with cloud services and third-party APIs. By implementing these strategies, businesses and developers can create functional and engaging iOS apps while minimizing development costs and time-to-market.
Consistent toolbox talks are critical for maintaining workplace safety, as they provide regular opportunities to address specific hazards and reinforce safe practices.
These brief, focused sessions ensure that safety is a continual conversation rather than a one-time event, which helps keep safety protocols fresh in employees' minds. Studies have shown that shorter, more frequent training sessions are more effective for retention and behavior change compared to longer, infrequent sessions.
Engaging workers regularly, toolbox talks promote a culture of safety, empower employees to voice concerns, and ultimately reduce the likelihood of accidents and injuries on site.
The traditional method of conducting safety talks with paper documents and lengthy meetings is not only time-consuming but also less effective. Manual tracking of attendance and compliance is prone to errors and inconsistencies, leading to gaps in safety communication and potential non-compliance with OSHA regulations. Switching to a digital solution like Safelyio offers significant advantages.
Safelyio automates the delivery and documentation of safety talks, ensuring consistency and accessibility. The microlearning approach breaks down complex safety protocols into manageable, bite-sized pieces, making it easier for employees to absorb and retain information.
This method minimizes disruptions to work schedules, eliminates the hassle of paperwork, and ensures that all safety communications are tracked and recorded accurately. Ultimately, using a digital platform like Safelyio enhances engagement, compliance, and overall safety performance on site. https://safelyio.com/
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...kalichargn70th171
Visual testing plays a vital role in ensuring that software products meet the aesthetic requirements specified by clients in functional and non-functional specifications. In today's highly competitive digital landscape, users expect a seamless and visually appealing online experience. Visual testing, also known as automated UI testing or visual regression testing, verifies the accuracy of the visual elements that users interact with.
Penify - Let AI do the Documentation, you write the Code.KrishnaveniMohan1
Penify automates the software documentation process for Git repositories. Every time a code modification is merged into "main", Penify uses a Large Language Model to generate documentation for the updated code. This automation covers multiple documentation layers, including InCode Documentation, API Documentation, Architectural Documentation, and PR documentation, each designed to improve different aspects of the development process. By taking over the entire documentation process, Penify tackles the common problem of documentation becoming outdated as the code evolves.
https://www.penify.dev/
Photoshop Tutorial for Beginners (2024 Edition)alowpalsadig
Photoshop Tutorial for Beginners (2024 Edition)
Explore the evolution of programming and software development and design in 2024. Discover emerging trends shaping the future of coding in our insightful analysis."
Here's an overview:Introduction: The Evolution of Programming and Software DevelopmentThe Rise of Artificial Intelligence and Machine Learning in CodingAdopting Low-Code and No-Code PlatformsQuantum Computing: Entering the Software Development MainstreamIntegration of DevOps with Machine Learning: MLOpsAdvancements in Cybersecurity PracticesThe Growth of Edge ComputingEmerging Programming Languages and FrameworksSoftware Development Ethics and AI RegulationSustainability in Software EngineeringThe Future Workforce: Remote and Distributed TeamsConclusion: Adapting to the Changing Software Development LandscapeIntroduction: The Evolution of Programming and Software Development
Photoshop Tutorial for Beginners (2024 Edition)Explore the evolution of programming and software development and design in 2024. Discover emerging trends shaping the future of coding in our insightful analysis."Here's an overview:Introduction: The Evolution of Programming and Software DevelopmentThe Rise of Artificial Intelligence and Machine Learning in CodingAdopting Low-Code and No-Code PlatformsQuantum Computing: Entering the Software Development MainstreamIntegration of DevOps with Machine Learning: MLOpsAdvancements in Cybersecurity PracticesThe Growth of Edge ComputingEmerging Programming Languages and FrameworksSoftware Development Ethics and AI RegulationSustainability in Software EngineeringThe Future Workforce: Remote and Distributed TeamsConclusion: Adapting to the Changing Software Development LandscapeIntroduction: The Evolution of Programming and Software Development
The importance of developing and designing programming in 2024
Programming design and development represents a vital step in keeping pace with technological advancements and meeting ever-changing market needs. This course is intended for anyone who wants to understand the fundamental importance of software development and design, whether you are a beginner or a professional seeking to update your knowledge.
Course objectives:
1. **Learn about the basics of software development:
- Understanding software development processes and tools.
- Identify the role of programmers and designers in software projects.
2. Understanding the software design process:
- Learn about the principles of good software design.
- Discussing common design patterns such as Object-Oriented Design.
3. The importance of user experience (UX) in modern software:
- Explore how user experience can improve software acceptance and usability.
- Tools and techniques to analyze and improve user experience.
4. Increase efficiency and productivity through modern development tools:
- Access to the latest programming tools and languages used in the industry.
- Study live examples of applications
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid
IBM watsonx Code Assistant for Z, our latest Generative AI-assisted mainframe application modernization solution. Mainframe (IBM Z) application modernization is a topic that every mainframe client is addressing to various degrees today, driven largely from digital transformation. With generative AI comes the opportunity to reimagine the mainframe application modernization experience. Infusing generative AI will enable speed and trust, help de-risk, and lower total costs associated with heavy-lifting application modernization initiatives. This document provides an overview of the IBM watsonx Code Assistant for Z which uses the power of generative AI to make it easier for developers to selectively modernize COBOL business services while maintaining mainframe qualities of service.
Software Test Automation - A Comprehensive Guide on Automated Testing.pdfkalichargn70th171
Moving to a more digitally focused era, the importance of software is rapidly increasing. Software tools are crucial for upgrading life standards, enhancing business prospects, and making a smart world. The smooth and fail-proof functioning of the software is very critical, as a large number of people are dependent on them.
Software Test Automation - A Comprehensive Guide on Automated Testing.pdf
Galois: A System for Parallel Execution of Irregular Algorithms
1. Galois: A System for Parallel
Execution of Irregular Algorithms
Donald Nguyen
The University of Texas at Austin
2. Parallel Programming is Changing
• Data
– Structured (vector, matrix) versus unstructured (graphs)
• Platforms
– Dedicated clusters versus cloud, mobile
• People
– Small number of highly trained scientists versus large number of self-
trained parallel programmers
Old World New World
2
3. The Search for
Good Programming Models
• Tension between
productivity and
performance
– Support large number of
application programmers with
small number of expert
parallel programmers
– Performance comparable to
hand-written codes
• Galois project
3
Joe
Stephanie
4. The Search for
Good Programming Models
• Tension between
productivity and
performance
– Support large number of
application programmers with
small number of expert
parallel programmers
– Performance comparable to
hand-written codes
• Galois project
4
Joe
Stephanie
5. Outline
• Parallel Program = Operator + Schedule +
e
• Galois system implementation
– Operator, data structures, scheduling
• Galois studies
– Intel study
– Deterministic scheduling
– Adding a scheduler: sparse tiling
– Comparison with transactional memory
– Implementing other programming models
5
Parallel Data Structure
6. SSSP
• Find the shortest distance from source node to all
other nodes in a graph
– Label nodes with tentative distance
– Assume non-negative edge weights
– BFS is a special case
• Algorithms
– Chaoticrelaxation O(2V)
– Bellman-Ford O(VE)
– Dijkstra’s algorithm O(E log V)
• Uses priority queue
– Δ-stepping
• Uses sequence of bags to prioritize work
• Δ=1: Dijkstra
• Δ=∞: Chaotic relaxation
• Different algorithms are different schedules for
applying relaxations
– Improved work efficiency
– Input sensitivity
– Locality
5
3
1
9
∞∞
∞
0
1 9
3
Operator
Relaxation
2
6
7. SSSP
• Find the shortest distance from source node to all
other nodes in a graph
– Label nodes with tentative distance
– Assume non-negative edge weights
– BFS is a special case
• Algorithms
– Chaoticrelaxation O(2V)
– Bellman-Ford O(VE)
– Dijkstra’s algorithm O(E log V)
• Uses priority queue
– Δ-stepping
• Uses sequence of bags to prioritize work
• Δ=1: Dijkstra
• Δ=∞: Chaotic relaxation
• Different algorithms are different schedules for
applying relaxations
– Improved work efficiency
– Input sensitivity
– Locality
5
3
1
9
∞∞
∞
0
Active node
1 9
3
Operator
Relaxation
2
7
8. SSSP
• Find the shortest distance from source node to all
other nodes in a graph
– Label nodes with tentative distance
– Assume non-negative edge weights
– BFS is a special case
• Algorithms
– Chaoticrelaxation O(2V)
– Bellman-Ford O(VE)
– Dijkstra’s algorithm O(E log V)
• Uses priority queue
– Δ-stepping
• Uses sequence of bags to prioritize work
• Δ=1: Dijkstra
• Δ=∞: Chaotic relaxation
• Different algorithms are different schedules for
applying relaxations
– Improved work efficiency
– Input sensitivity
– Locality
5
3
1
9
∞∞
∞
0
Active node
1 9 4
3
1
3
Operator
Relaxation
2
8
9. SSSP
• Find the shortest distance from source node to all
other nodes in a graph
– Label nodes with tentative distance
– Assume non-negative edge weights
– BFS is a special case
• Algorithms
– Chaoticrelaxation O(2V)
– Bellman-Ford O(VE)
– Dijkstra’s algorithm O(E log V)
• Uses priority queue
– Δ-stepping
• Uses sequence of bags to prioritize work
• Δ=1: Dijkstra
• Δ=∞: Chaotic relaxation
• Different algorithms are different schedules for
applying relaxations
– Improved work efficiency
– Input sensitivity
– Locality
5
3
1
9
∞∞
∞
0
Active node
Neighborhood
Activity
1 9 4
3
1
3
Operator
Relaxation
2
9
10. SSSP
• Find the shortest distance from source node to all
other nodes in a graph
– Label nodes with tentative distance
– Assume non-negative edge weights
– BFS is a special case
• Algorithms
– Chaoticrelaxation O(2V)
– Bellman-Ford O(VE)
– Dijkstra’s algorithm O(E log V)
• Uses priority queue
– Δ-stepping
• Uses sequence of bags to prioritize work
• Δ=1: Dijkstra
• Δ=∞: Chaotic relaxation
• Different algorithms are different schedules for
applying relaxations
– Improved work efficiency
– Input sensitivity
– Locality
5
3
1
9
∞
0
Active node
1 9 4
3
1
3
Operator
Relaxation
2
1 5
10
11. SSSP
• Find the shortest distance from source node to all
other nodes in a graph
– Label nodes with tentative distance
– Assume non-negative edge weights
– BFS is a special case
• Algorithms
– Chaoticrelaxation O(2V)
– Bellman-Ford O(VE)
– Dijkstra’s algorithm O(E log V)
• Uses priority queue
– Δ-stepping
• Uses sequence of bags to prioritize work
• Δ=1: Dijkstra
• Δ=∞: Chaotic relaxation
• Different algorithms are different schedules for
applying relaxations
– Improved work efficiency
– Input sensitivity
– Locality
5
3
1
9
∞
0
Active node
1 9 4
3
1
3
Operator
Relaxation
2
1 5
11
12. SSSP
• Find the shortest distance from source node to all
other nodes in a graph
– Label nodes with tentative distance
– Assume non-negative edge weights
– BFS is a special case
• Algorithms
– Chaoticrelaxation O(2V)
– Bellman-Ford O(VE)
– Dijkstra’s algorithm O(E log V)
• Uses priority queue
– Δ-stepping
• Uses sequence of bags to prioritize work
• Δ=1: Dijkstra
• Δ=∞: Chaotic relaxation
• Different algorithms are different schedules for
applying relaxations
– Improved work efficiency
– Input sensitivity
– Locality
5
3
1
9
∞
0
Active node
1 9 4
3
1
3
Operator
Relaxation
2
Algorithm = Operator + Schedule
1 5
12
13. Stencil Computation
• Finite difference
computation
– Active nodes: nodes in At+1
– Operator: 5 point stencil
– Different schedules have
different locality
• Regular application
– Grid structure and active
nodes known statically
– Can be parallelized by a
compiler
13
At At+1
14. • Operator formulation
– Active elements: nodes or edges
where there is work to be done
– Operator: computation at active
element
• Activity: application of operator to active
element
• Neighborhood: graph elements read or
written by activity
– Ordering: order in which active
elements must appear to have been
processed
• Unordered algorithms: any order is fine
(e.g., chaotic relaxation) but may have soft
priorities
• Ordered algorithms: algorithm-specific
order (e.g., discrete event simulation)
– Parallelism: process activities in
parallel subject to neighborhood and
ordering constraints
• Data parallelism is a special case (all
nodes active, no conflicts)
: active node
: neighborhood
Abstraction of Algorithms
14
15. • Operator formulation
– Active elements: nodes or edges
where there is work to be done
– Operator: computation at active
element
• Activity: application of operator to active
element
• Neighborhood: graph elements read or
written by activity
– Ordering: order in which active
elements must appear to have been
processed
• Unordered algorithms: any order is fine
(e.g., chaotic relaxation) but may have soft
priorities
• Ordered algorithms: algorithm-specific
order (e.g., discrete event simulation)
– Parallelism: process activities in
parallel subject to neighborhood and
ordering constraints
• Data parallelism is a special case (all
nodes active, no conflicts)
: active node
: neighborhood
Abstraction of Algorithms
15
16. • Operator formulation
– Active elements: nodes or edges
where there is work to be done
– Operator: computation at active
element
• Activity: application of operator to active
element
• Neighborhood: graph elements read or
written by activity
– Ordering: order in which active
elements must appear to have been
processed
• Unordered algorithms: any order is fine
(e.g., chaotic relaxation) but may have soft
priorities
• Ordered algorithms: algorithm-specific
order (e.g., discrete event simulation)
– Parallelism: process activities in
parallel subject to neighborhood and
ordering constraints
• Data parallelism is a special case (all
nodes active, no conflicts)
: active node
: neighborhood
Abstraction of Algorithms
16
17. • Operator formulation
– Active elements: nodes or edges
where there is work to be done
– Operator: computation at active
element
• Activity: application of operator to active
element
• Neighborhood: graph elements read or
written by activity
– Ordering: order in which active
elements must appear to have been
processed
• Unordered algorithms: any order is fine
(e.g., chaotic relaxation) but may have soft
priorities
• Ordered algorithms: algorithm-specific
order (e.g., discrete event simulation)
– Parallelism: process activities in
parallel subject to neighborhood and
ordering constraints
• Data parallelism is a special case (all
nodes active, no conflicts)
: active node
: neighborhood
Abstraction of Algorithms
17
18. • Operator formulation
– Active elements: nodes or edges
where there is work to be done
– Operator: computation at active
element
• Activity: application of operator to active
element
• Neighborhood: graph elements read or
written by activity
– Ordering: order in which active
elements must appear to have been
processed
• Unordered algorithms: any order is fine
(e.g., chaotic relaxation) but may have soft
priorities
• Ordered algorithms: algorithm-specific
order (e.g., discrete event simulation)
– Parallelism: process activities in
parallel subject to neighborhood and
ordering constraints
• Data parallelism is a special case (all
nodes active, no conflicts)
: active node
: neighborhood
Abstraction of Algorithms
18
19. • Operator formulation
– Active elements: nodes or edges
where there is work to be done
– Operator: computation at active
element
• Activity: application of operator to active
element
• Neighborhood: graph elements read or
written by activity
– Ordering: order in which active
elements must appear to have been
processed
• Unordered algorithms: any order is fine
(e.g., chaotic relaxation) but may have soft
priorities
• Ordered algorithms: algorithm-specific
order (e.g., discrete event simulation)
– Parallelism: process activities in
parallel subject to neighborhood and
ordering constraints
• Data parallelism is a special case (all
nodes active, no conflicts)
: active node
: neighborhood
Abstraction of Algorithms
19
B
A
C D
20. • Operator formulation
– Active elements: nodes or edges
where there is work to be done
– Operator: computation at active
element
• Activity: application of operator to active
element
• Neighborhood: graph elements read or
written by activity
– Ordering: order in which active
elements must appear to have been
processed
• Unordered algorithms: any order is fine
(e.g., chaotic relaxation) but may have soft
priorities
• Ordered algorithms: algorithm-specific
order (e.g., discrete event simulation)
– Parallelism: process activities in
parallel subject to neighborhood and
ordering constraints
• Data parallelism is a special case (all
nodes active, no conflicts)
: active node
: neighborhood
Abstraction of Algorithms
20
B
A
C D
22. TAO of Parallelism
22
Pingali, Nguyen, et al. The tao of parallelism in algorithms. PLDI 2011.
Galois
Static
Inspector-Executor
Bulk-Synchronous
Optimistic
Round-based
24. Galois Project
Parallel Program = Operator + Schedule + Parallel Data Structure
Stephanie: Parallel Data Structures
Joe: Operator + Schedule
Program = Algorithm + Data Structure
25. My Contribution
Design and implementation of a programming
model for high-performance parallelism by
factoring programs into
operator + schedule + data structure
25
26. Themes
• Joe
– Program = Operator + Schedule + Data Structure
• Stephanie
– Scalability principle1: disjoint access
• Accesses to distinct logical entities should be disjoint physically as well
• “Do no harm”
– Scalability principle2: virtualizetasks
• Execution of tasks should not be eagerly tied to particular physical resources
• “Be flexible”
• Challenges
– Easy to introduce inadvertentsharing if not looking at entire system
stack
– Small task sizes (~ 100s of instructions)
26
27. Publications
• M. Amber Hassaan, Donald Nguyen, Keshav Pingali. Kinetic Dependence Graphs (to appear). ASPLOS 2015.
• Konstantions Karantasis, Andrew Lenharth, Donald Nguyen, et al. Parallelization of reordering algorithms for
bandwidth and wavefront reduction. SC 2014.
• Damian Goik, Konrad Jopek, Maciej Paszyński, Andrew Lenharth, Donald Nguyen, and Keshav Pingali. Graph
grammar based multi-thread multi-frontal direct solver with galois scheduler. ICCS 2014.
• Donald Nguyen, Andrew Lenharth, Keshav Pingali. Deterministic Galois: On-demand, parameterless and
portable. ASPLOS 2014.
• Donald Nguyen, Andrew Lenharth, Keshav Pingali. A lightweight infrastructure for graph analytics. SOSP 2013.
• Keshav Pingali, Donald Nguyen, Milind Kulkarni, et al. The tao of parallelism in algorithms. PLDI 2011.
• Donald Nguyen, Keshav Pingali. Synthesizing concurrent schedulers for irregular algorithms. ASPLOS 2011.
• Milind Kulkarni, Donald Nguyen, Dimitrios Prountzos, et al. Exploiting the commutativity lattice. PLDI 2011.
• Xin Sui, Donald Nguyen, Martin Burtscher, Keshav Pingali. Parallel Graph Partitioning on Multicore
Architectures. LCPC 2010.
• Mario Méndez-Lojo, Donald Nguyen, Dimitrios Prountzos, et al. Structure-driven optimizations for amorphous
data-parallel programs. PPOPP 2010.
• Shih-wei Liao, Tzu-Han Hung, Donald Nguyen, et al. Machine learning-based prefetch optimization for data
center applications. SC 2009.
27
28. Publications
• M. Amber Hassaan, Donald Nguyen, Keshav Pingali. Kinetic Dependence Graphs (to appear). ASPLOS 2015.
• Konstantions Karantasis, Andrew Lenharth, Donald Nguyen, et al. Parallelization of reordering algorithms for
bandwidth and wavefront reduction. SC 2014.
• Damian Goik, Konrad Jopek, Maciej Paszyński, Andrew Lenharth, Donald Nguyen, and Keshav Pingali. Graph
grammar based multi-thread multi-frontal direct solver with galois scheduler. ICCS 2014.
• Donald Nguyen, Andrew Lenharth, Keshav Pingali. Deterministic Galois: On-demand, parameterless and
portable. ASPLOS 2014.
• Donald Nguyen, Andrew Lenharth, Keshav Pingali. A lightweight infrastructure for graph analytics. SOSP 2013.
• Keshav Pingali, Donald Nguyen, Milind Kulkarni, et al. The tao of parallelism in algorithms. PLDI 2011.
• Donald Nguyen, Keshav Pingali. Synthesizing concurrent schedulers for irregular algorithms. ASPLOS 2011.
• Milind Kulkarni, Donald Nguyen, Dimitrios Prountzos, et al. Exploiting the commutativity lattice. PLDI 2011.
• Xin Sui, Donald Nguyen, Martin Burtscher, Keshav Pingali. Parallel Graph Partitioning on Multicore
Architectures. LCPC 2010.
• Mario Méndez-Lojo, Donald Nguyen, Dimitrios Prountzos, et al. Structure-driven optimizations for amorphous
data-parallel programs. PPOPP 2010.
• Shih-wei Liao, Tzu-Han Hung, Donald Nguyen, et al. Machine learning-based prefetch optimization for data
center applications. SC 2009.
28
29. Outline
• Parallel Program = Operator + Schedule +
e
• Galois system implementation
– Operator, data structures, scheduling
• Galois studies
– Intel study
– Deterministic scheduling
– Adding a scheduler: sparse tiling
– Comparison with transactional memory
– Implementing other programming models
29
Parallel Data Structure
30. Galois Programming Model
• Galois system (Stephanie)
– Runtimeand libraryof concurrent data structures
• Galois system (Joe)
– Set iteratorsexpress parallelism
– Operator: body of iterator
– ADT (e.g., graph, set)
• Unordered execution
– Could haveconflicts
– Some serialization of iterations
– New iterations can be added
• Ordered execution
– See Hassaan, Nguyen and Pingali. Kinetic Dependence Graphs (to appear).
ASPLOS 2015.
Graph g
for_each (Edge e : wl)
Node n = g.getEdgeDst(e)
…
31. Hello Graph
31
#include “Galois/Galois.h”
#include “Galois/Graph/Graph.h”
typedef Galois::Graph::LC_CSR_Graph<int, int> Graph;
typedef Graph::GraphNode GNode;
Graph graph;
struct P {
void operator()(GNode& n, Galois::UserContext<GNode>& c) {
int& d = graph.getData(n);
if (d.value++ < 5)
c.push(n);
}
};
int main(int argc, char** argv) {
Galois::Graph::readGraph(graph, argv[1]);
…
Galois::for_each(initial, P(), Galois::wl<WL>());
return 0;
}
32. Hello Graph
32
#include “Galois/Galois.h”
#include “Galois/Graph/Graph.h”
typedef Galois::Graph::LC_CSR_Graph<int, int> Graph;
typedef Graph::GraphNode GNode;
Graph graph;
struct P {
void operator()(GNode& n, Galois::UserContext<GNode>& c) {
int& d = graph.getData(n);
if (d.value++ < 5)
c.push(n);
}
};
int main(int argc, char** argv) {
Galois::Graph::readGraph(graph, argv[1]);
…
Galois::for_each(initial, P(), Galois::wl<WL>());
return 0;
}
Data structure
declarations
Galois Iterator
Operator
33. Hello Graph
33
#include “Galois/Galois.h”
#include “Galois/Graph/Graph.h”
typedef Galois::Graph::LC_CSR_Graph<int, int> Graph;
typedef Graph::GraphNode GNode;
Graph graph;
struct P {
void operator()(GNode& n, Galois::UserContext<GNode>& c) {
int& d = graph.getData(n);
if (d.value++ < 5)
c.push(n);
}
};
int main(int argc, char** argv) {
Galois::Graph::readGraph(graph, argv[1]);
…
Galois::for_each(initial, P(), Galois::wl<WL>());
return 0;
}
Data structure
declarations
Galois Iterator
Operator
Sequential program semantics
34. Hello Graph
34
#include “Galois/Galois.h”
#include “Galois/Graph/Graph.h”
typedef Galois::Graph::LC_CSR_Graph<int, int> Graph;
typedef Graph::GraphNode GNode;
Graph graph;
struct P {
void operator()(GNode& n, Galois::UserContext<GNode>& c) {
int& d = graph.getData(n);
if (d.value++ < 5)
c.push(n);
}
};
int main(int argc, char** argv) {
Galois::Graph::readGraph(graph, argv[1]);
…
Galois::for_each(initial, P(), Galois::wl<WL>());
return 0;
}
Data structure
declarations
Galois Iterator
Operator
Neighborhood defined
by graph API calls
Sequential program semantics
35. Hello Graph
35
#include “Galois/Galois.h”
#include “Galois/Graph/Graph.h”
typedef Galois::Graph::LC_CSR_Graph<int, int> Graph;
typedef Graph::GraphNode GNode;
Graph graph;
struct P {
void operator()(GNode& n, Galois::UserContext<GNode>& c) {
int& d = graph.getData(n);
if (d.value++ < 5)
c.push(n);
}
};
int main(int argc, char** argv) {
Galois::Graph::readGraph(graph, argv[1]);
…
Galois::for_each(initial, P(), Galois::wl<WL>());
return 0;
}
struct TaskIndexer {
int operator()(GNode& n) {
return graph.getData(n) >> shift;
}
};
using namespace Galois::WorkList;
typedef dChunkedFIFO<> WL;
typedef OrderedByIntegerMetric<TaskIndexer> WL;
Neighborhood defined
by graph API calls
Sequential program semantics
Scheduling DSL
43. Galois Operator
• All data accesses go through API
– Galois data implementations maintainmarks
– When marks overlap, rollback activity and try later
• Cautious
– Commonlyoperators read their entire neighborhood before modifying any element
– Failsafe point: point in operator before first write
• Non-deterministic
• Difference from Thread Level Speculation [Rauchwerger and Padua, PLDI 1995]
– Iterations are unordered
– Operators are cautious
– Conflicts are detected at ADT level rather than memory level
43
B
A
T1
T2
Graph g
for_each (Node n : wl)
for (Node m : g.edges(n))
…
T1 T2
44. Galois Data Structures
• Graphs
– May require multiple
loads to access data for
a node
– Exploit local computation
structure
– Simple allocation
schemes may be
imbalanced in physical
memory
45. Galois Data Structures
• Graphs
– May require multiple
loads to access data for
a node
– Exploit local computation
structure
– Simple allocation
schemes may be
imbalanced in physical
memory
nd
ed
dst
node data[N]
edge idx[N]
edge data[E]
edge dst[E]
CSR
nd: node data
ed: edge data
dst: edge destination
len: # of edges
46. Galois Data Structures
• Graphs
– May require multiple
loads to access data for
a node
– Exploit local computation
structure
– Simple allocation
schemes may be
imbalanced in physical
memory
nd
ed
dst
Inlining
Inlining: Andrew Lenharth
nd: node data
ed: edge data
dst: edge destination
len: # of edges
47. Galois Data Structures
• Graphs
– May require multiple
loads to access data for
a node
– Exploit local computation
structure
– Simple allocation
schemes may be
imbalanced in physical
memory
nd
ed dst
Inlining
Inlining: Andrew Lenharth
nd: node data
ed: edge data
dst: edge destination
len: # of edges
48. Galois Data Structures
• Graphs
– May require multiple
loads to access data for
a node
– Exploit local computation
structure
– Simple allocation
schemes may be
imbalanced in physical
memory
nd len ed ed
Inlining
Inlining: Andrew Lenharth
nd: node data
ed: edge data
dst: edge destination
len: # of edges
49. Galois Data Structures
• Graphs
– May require multiple
loads to access data for
a node
– Exploit local computation
structure
– Simple allocation
schemes may be
imbalanced in physical
memory
nd len ed ed
Inlining
Inlining: Andrew Lenharth
50. Galois Data Structures
• Graphs
– May require multiple
loads to access data for
a node
– Exploit local computation
structure
– Simple allocation
schemes may be
imbalanced in physical
memory
nd len ed ed
Inlining
Placement
DRAM 0 DRAM 1
Virtual Memory
Physical Memory
Inlining: Andrew Lenharth
c1 c2
51. Scheduling in Galois
• Users write high-level scheduling policies
– DSL for specifying policies
• Separation of concerns
– Users understand their operator but have vague
understanding of scheduling
– Schedulers are difficult to implement efficiently
• Galois system synthesizes scheduler by
composing elements from a scheduler library
51
Nguyen and Pingali. Synthesizing concurrent schedulers for irregular algorithms. ASPLOS 2011.
78. Galois System
• Joe
– Writes sequential code
– Uses Galois data structures
and scheduling hints
• Stephanie
– Writes concurrent code
– Provides variety of data
structures and scheduling
policies for Joe to choose
from
• cf., Frameworks that only
support one scheduler or
one graph implementation
78
Stephanie: Parallel Data Structures
Joe: Operator + Schedule
79. Outline
• Parallel Program = Operator + Schedule +
e
• Galois system implementation
– Operator, data structures, scheduling
• Galois studies
– Intel study
– Deterministic scheduling
– Adding a scheduler: sparse tiling
– Comparison with transactional memory
– Implementing other programming models
79
Parallel Data Structure
80. Intel Study
80
0.1
1
10
tion(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
(b) Breadth-First Search
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
(c) Collaborative Fil
Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s
represents runtime (in log-scale), therefore lower numbers are better.
FDR interconnect. The cluster runs on the Red Hat Enterprise
Linux Server OS release 6.4. We use a mix of OpenMP directives
to parallelize within the node and MPI code to parallelize across
nodes in native code. We use the Intel®
C++ Composer XE 2013
SP1 Compiler3
and the Intel®
MPI library to compile the code.
GraphLab uses a similar combination of OpenMP and MPI for best
performance. CombBLAS runs best as a pure MPI program. We
use multiple MPI processes per node to take advantage of the mul-
ideal results. Given
gorithms, and also be
tions, we expect that i
excel at all scales in te
5.2 Single nod
We now show the r
orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
1
10
Netflix
Synthetic
Timeperitera
(c) Collaborative Filtering
10
100
1000
10000
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim
(d) Triangle counting
rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
81. Intel Study
81
0.1
1
10
tion(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
(b) Breadth-First Search
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
(c) Collaborative Fil
Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s
represents runtime (in log-scale), therefore lower numbers are better.
FDR interconnect. The cluster runs on the Red Hat Enterprise
Linux Server OS release 6.4. We use a mix of OpenMP directives
to parallelize within the node and MPI code to parallelize across
nodes in native code. We use the Intel®
C++ Composer XE 2013
SP1 Compiler3
and the Intel®
MPI library to compile the code.
GraphLab uses a similar combination of OpenMP and MPI for best
performance. CombBLAS runs best as a pure MPI program. We
use multiple MPI processes per node to take advantage of the mul-
ideal results. Given
gorithms, and also be
tions, we expect that i
excel at all scales in te
5.2 Single nod
We now show the r
orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
1
10
Netflix
Synthetic
Timeperitera
(c) Collaborative Filtering
10
100
1000
10000
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim
(d) Triangle counting
rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
82. Intel Study
82
0.1
1
10
tion(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
(b) Breadth-First Search
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
(c) Collaborative Fil
Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s
represents runtime (in log-scale), therefore lower numbers are better.
FDR interconnect. The cluster runs on the Red Hat Enterprise
Linux Server OS release 6.4. We use a mix of OpenMP directives
to parallelize within the node and MPI code to parallelize across
nodes in native code. We use the Intel®
C++ Composer XE 2013
SP1 Compiler3
and the Intel®
MPI library to compile the code.
GraphLab uses a similar combination of OpenMP and MPI for best
performance. CombBLAS runs best as a pure MPI program. We
use multiple MPI processes per node to take advantage of the mul-
ideal results. Given
gorithms, and also be
tions, we expect that i
excel at all scales in te
5.2 Single nod
We now show the r
orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
1
10
Netflix
Synthetic
Timeperitera
(c) Collaborative Filtering
10
100
1000
10000
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim
(d) Triangle counting
rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
83. Intel Study
83
0.1
1
10
tion(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
(b) Breadth-First Search
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
(c) Collaborative Fil
Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s
represents runtime (in log-scale), therefore lower numbers are better.
FDR interconnect. The cluster runs on the Red Hat Enterprise
Linux Server OS release 6.4. We use a mix of OpenMP directives
to parallelize within the node and MPI code to parallelize across
nodes in native code. We use the Intel®
C++ Composer XE 2013
SP1 Compiler3
and the Intel®
MPI library to compile the code.
GraphLab uses a similar combination of OpenMP and MPI for best
performance. CombBLAS runs best as a pure MPI program. We
use multiple MPI processes per node to take advantage of the mul-
ideal results. Given
gorithms, and also be
tions, we expect that i
excel at all scales in te
5.2 Single nod
We now show the r
orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
1
10
Netflix
Synthetic
Timeperitera
(c) Collaborative Filtering
10
100
1000
10000
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim
(d) Triangle counting
rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
84. Intel Study
84
0.1
1
10
tion(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
(b) Breadth-First Search
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
(c) Collaborative Fil
Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s
represents runtime (in log-scale), therefore lower numbers are better.
FDR interconnect. The cluster runs on the Red Hat Enterprise
Linux Server OS release 6.4. We use a mix of OpenMP directives
to parallelize within the node and MPI code to parallelize across
nodes in native code. We use the Intel®
C++ Composer XE 2013
SP1 Compiler3
and the Intel®
MPI library to compile the code.
GraphLab uses a similar combination of OpenMP and MPI for best
performance. CombBLAS runs best as a pure MPI program. We
use multiple MPI processes per node to take advantage of the mul-
ideal results. Given
gorithms, and also be
tions, we expect that i
excel at all scales in te
5.2 Single nod
We now show the r
orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
1
10
Netflix
Synthetic
Timeperitera
(c) Collaborative Filtering
10
100
1000
10000
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim
(d) Triangle counting
rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
85. Intel Study
85
0.1
1
10
tion(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
(b) Breadth-First Search
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
(c) Collaborative Fil
Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s
represents runtime (in log-scale), therefore lower numbers are better.
FDR interconnect. The cluster runs on the Red Hat Enterprise
Linux Server OS release 6.4. We use a mix of OpenMP directives
to parallelize within the node and MPI code to parallelize across
nodes in native code. We use the Intel®
C++ Composer XE 2013
SP1 Compiler3
and the Intel®
MPI library to compile the code.
GraphLab uses a similar combination of OpenMP and MPI for best
performance. CombBLAS runs best as a pure MPI program. We
use multiple MPI processes per node to take advantage of the mul-
ideal results. Given
gorithms, and also be
tions, we expect that i
excel at all scales in te
5.2 Single nod
We now show the r
orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
1
10
Netflix
Synthetic
Timeperitera
(c) Collaborative Filtering
10
100
1000
10000
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim
(d) Triangle counting
rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
86. Intel Study
86
0.1
1
10
tion(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.01
0.1
Livejournal
Facebook
Wikipedia
Synthetic
Timeperiterat
(a) PageRank
10
100
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0
1
Livejournal
Facebook
Wikipedia
Synthetic
Overalltime
(b) Breadth-First Search
10
100
1000
ation(seconds)
Native Combblas
Socialite Giraph
1
10
Netflix
Timeperitera
(c) Collaborative Fil
Figure 3: Performance results for different algorithms on real-world and synthetic graphs that are s
represents runtime (in log-scale), therefore lower numbers are better.
FDR interconnect. The cluster runs on the Red Hat Enterprise
Linux Server OS release 6.4. We use a mix of OpenMP directives
to parallelize within the node and MPI code to parallelize across
nodes in native code. We use the Intel®
C++ Composer XE 2013
SP1 Compiler3
and the Intel®
MPI library to compile the code.
GraphLab uses a similar combination of OpenMP and MPI for best
performance. CombBLAS runs best as a pure MPI program. We
use multiple MPI processes per node to take advantage of the mul-
ideal results. Given
gorithms, and also be
tions, we expect that i
excel at all scales in te
5.2 Single nod
We now show the r
orative Filtering andNadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
Combblas Graphlab
Giraph Galois
Livejournal
Facebook
Wikipedia
Synthetic
readth-First Search
10
100
1000
ation(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
1
10
Netflix
Synthetic
Timeperitera
(c) Collaborative Filtering
10
100
1000
10000
e(seconds)
Native Combblas Graphlab
Socialite Giraph Galois
0.1
1
Livejournal
Facebook
Wikipedia
Synthetic
OverallTim
(d) Triangle counting
rithms on real-world and synthetic graphs that are small enough to run on a single node. The y-axis
87. Outline
• Parallel Program = Operator + Schedule +
e
• Galois system implementation
– Operator, data structures, scheduling
• Galois studies
– Intel study
– Deterministic scheduling
– Adding a scheduler: sparse tiling
– Comparison with transactional memory
– Implementing other programming models
87
Parallel Data Structure
88. Deterministic Interference Graph
(DIG) Scheduling
• Traditional Galois
execution
– Asynchronous, non-
deterministic
• Deterministic Galois
execution
– Round-based execution
– Construct an interference
graph
– Execute independent set
– Repeat
88
Interference Graph
Data Structure
B
A
C
Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand,
parameterless and portable. ASPLOS 2014.
89. Deterministic Interference Graph
(DIG) Scheduling
• Traditional Galois
execution
– Asynchronous, non-
deterministic
• Deterministic Galois
execution
– Round-based execution
– Construct an interference
graph
– Execute independent set
– Repeat
89
Interference Graph
Data Structure
B
A
C
Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand,
parameterless and portable. ASPLOS 2014.
90. Deterministic Interference Graph
(DIG) Scheduling
• Traditional Galois
execution
– Asynchronous, non-
deterministic
• Deterministic Galois
execution
– Round-based execution
– Construct an interference
graph
– Execute independent set
– Repeat
90
B
A
C
Interference Graph
Data Structure
B
A
C
Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand,
parameterless and portable. ASPLOS 2014.
91. Deterministic Interference Graph
(DIG) Scheduling
• Traditional Galois
execution
– Asynchronous, non-
deterministic
• Deterministic Galois
execution
– Round-based execution
– Construct an interference
graph
– Execute independent set
– Repeat
91
B
A
C
Interference Graph
Data Structure
B
A
C
Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand,
parameterless and portable. ASPLOS 2014.
92. Deterministic Interference Graph
(DIG) Scheduling
• Traditional Galois
execution
– Asynchronous, non-
deterministic
• Deterministic Galois
execution
– Round-based execution
– Construct an interference
graph
– Execute independent set
– Repeat
92
Interference Graph
Data Structure
B
C
Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand,
parameterless and portable. ASPLOS 2014.
93. Deterministic Interference Graph
(DIG) Scheduling
• Traditional Galois
execution
– Asynchronous, non-
deterministic
• Deterministic Galois
execution
– Round-based execution
– Construct an interference
graph
– Execute independent set
– Repeat
93
Interference Graph
Data Structure
B
C
Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand,
parameterless and portable. ASPLOS 2014.
94. Deterministic Interference Graph
(DIG) Scheduling
• Traditional Galois
execution
– Asynchronous, non-
deterministic
• Deterministic Galois
execution
– Round-based execution
– Construct an interference
graph
– Execute independent set
– Repeat
94
B C
Interference Graph
Data Structure
B
C
Nguyen, Lenharth and Pingali. Deterministic Galois: on-demand,
parameterless and portable. ASPLOS 2014.
95. DIG Scheduling in Practice
• Just find roots in interference graph
• Sequence of rounds
– Round has two phases
• Phase 1: mark neighborhood
– Acquire marks by writing task id
atomically until failsafe point
– Overwriting an id only if replacing
greater value
– Final mark values are deterministic
• Phase 2: execute roots
– Execute tasks whose neighborhood
only contains their own marks
95
A < B < C
96. DIG Scheduling in Practice
• Just find roots in interference graph
• Sequence of rounds
– Round has two phases
• Phase 1: mark neighborhood
– Acquire marks by writing task id
atomically until failsafe point
– Overwriting an id only if replacing
greater value
– Final mark values are deterministic
• Phase 2: execute roots
– Execute tasks whose neighborhood
only contains their own marks
96
A
A < B < C
97. DIG Scheduling in Practice
• Just find roots in interference graph
• Sequence of rounds
– Round has two phases
• Phase 1: mark neighborhood
– Acquire marks by writing task id
atomically until failsafe point
– Overwriting an id only if replacing
greater value
– Final mark values are deterministic
• Phase 2: execute roots
– Execute tasks whose neighborhood
only contains their own marks
97
A
A
A
A
A < B < C
98. DIG Scheduling in Practice
• Just find roots in interference graph
• Sequence of rounds
– Round has two phases
• Phase 1: mark neighborhood
– Acquire marks by writing task id
atomically until failsafe point
– Overwriting an id only if replacing
greater value
– Final mark values are deterministic
• Phase 2: execute roots
– Execute tasks whose neighborhood
only contains their own marks
98
A
C
A
A
A
A < B < C
99. DIG Scheduling in Practice
• Just find roots in interference graph
• Sequence of rounds
– Round has two phases
• Phase 1: mark neighborhood
– Acquire marks by writing task id
atomically until failsafe point
– Overwriting an id only if replacing
greater value
– Final mark values are deterministic
• Phase 2: execute roots
– Execute tasks whose neighborhood
only contains their own marks
99
A
C
A
A
A
C
C
C
A < B < C
100. DIG Scheduling in Practice
• Just find roots in interference graph
• Sequence of rounds
– Round has two phases
• Phase 1: mark neighborhood
– Acquire marks by writing task id
atomically until failsafe point
– Overwriting an id only if replacing
greater value
– Final mark values are deterministic
• Phase 2: execute roots
– Execute tasks whose neighborhood
only contains their own marks
100
B
A
C
A
A
A
C
C
C
A < B < C
101. DIG Scheduling in Practice
• Just find roots in interference graph
• Sequence of rounds
– Round has two phases
• Phase 1: mark neighborhood
– Acquire marks by writing task id
atomically until failsafe point
– Overwriting an id only if replacing
greater value
– Final mark values are deterministic
• Phase 2: execute roots
– Execute tasks whose neighborhood
only contains their own marks
101
B
A
C
A
A
A
C
B
C
B
A < B < C
102. Optimizations
• Resuming tasks
– Avoid re-executing tasks
– Suspend and resume execution at failsafe point
– Provide buffers to save local state
• Windowing
– Inspect only a subset of tasks at a time
– Adaptive algorithm varies window size
102
103. Evaluation of Deterministic
Scheduling
Platform
• Intel 4x10 core (2.27 GHz)
Deterministic Systems
• RCDC [Devietti11]
– Deterministic by scheduling
– Implementation of Kendo
algorithm [Olszewski09]
• PBBS
– Deterministic by construction
– Handwritten deterministic
implementations
• Deterministic Galois
– Non-deterministic programs
(g-n) automatically made
deterministic (g-d)
103
Applications
• PBBS [Blelloch11]
– Breadth-first search (bfs)
– Delaunay mesh refinement
(dmr)
– Delaunay triangulation (dt)
– Maximal independent set
(mis)
111. Outline
• Parallel Program = Operator + Schedule +
e
• Galois system implementation
– Operator, data structures, scheduling
• Galois studies
– Intel study
– Deterministic scheduling
– Adding a scheduler: sparse tiling
– Comparison with transactional memory
– Implementing other programming models
111
Parallel Data Structure
112. Matrix Completion
• Matrix completion problem
– Netflix problem
• An optimization problem
– Find m×k matrix W and k×n
matrix H (k << min(m,n))
such that A ≈ WH
– Low-rank approximation
• Algorithms
– Stochastic gradient descent
(SGD)
– …
1
i j
Aij
Users Movies
2
3
1
2
3
4
Wi* H*j
k
W
H
A
Matrix View
Graph View
Users
Movies
k
113. Naïve SGD Implementation
• Row-wise operator
113
for (User u : users)
for (Item m : items(u))
op(u, m, Aij);
• Poor cache behavior
– With k = 100, Wi* ≈ 1KB
– Degree(user) > 1000
W
H
A
…
…
100
1000
114. 2D Tiled SGD
• Apply operator to
small 2D tiles
114
W
H
A
…
…
• Additional concerns
– Conflict-free scheduling of tiles
• Future optimizations
– Adaptive, non-uniform tiling
115. Evaluation of Tiling
Parameters
• Same SGD parameters, initial latent
vectors between implementations
• ε(n) = alpha / (1 + beta * n1.5)
• Handtuned tile size
Datasets
115
Implementations
• Galois
– 2D scheduling
• 2 weeks to implement
– Synchronized updates
• GraphLab
– Standard GraphLab only
supports GD
– Implement 1D SGD with
unsynchronized updates
• Nomad
– From scratch distributed code
• ~1 year to implement
– 2D scheduling
• Tile size is function of #threads
– Synchronized updates
Items Users Ratings Sparsity
netflix 17K 480K 99M 1e-2
yahoo 624K 1M 253M 4e-4
Yun, Yu, Hsieh, Vishwanathan, Dhillon. NOMAD: Non-locking stOchastic Multi-machine
algorithm for Asynchronous and Decentralized matrix completion. VLDB 2013.
118. Outline
• Parallel Program = Operator + Schedule +
e
• Galois system implementation
– Operator, data structures, scheduling
• Galois studies
– Intel study
– Deterministic scheduling
– Adding a scheduler: sparse tiling
– Comparison with transactional memory
– Implementing other programming models
118
Parallel Data Structure
119. Transactional Memory
• Alternative to adopting a new
programming model
– Parallelization reusing existing
sequential programs
• Transactional memory
– Add threads and transactions
but keep existing sequential
code
– TM system ensures correct
execution of atomic regions
• Software and hardwareimpl.
– Speculative execution
119
while (begin != end)
begin += 1
while (head)
...
head = next
root.size += 1
fork()
while (begin != end)
atomic
begin += 1
atomic
while (head)
...
head = next
root.size += 1
join()
Sequential Program
TM Program
120. Transactional Memory
• Alternative to adopting a new
programming model
– Parallelization reusing existing
sequential programs
• Transactional memory
– Add threads and transactions
but keep existing sequential
code
– TM system ensures correct
execution of atomic regions
• Software and hardwareimpl.
– Speculative execution
120
while (begin != end)
begin += 1
while (head)
...
head = next
root.size += 1
fork()
while (begin != end)
atomic
begin += 1
atomic
while (head)
...
head = next
root.size += 1
join()
Sequential Program
TM Program
121. Transactional Memory
• Alternative to adopting a new
programming model
– Parallelization reusing existing
sequential programs
• Transactional memory
– Add threads and transactions
but keep existing sequential
code
– TM system ensures correct
execution of atomic regions
• Software and hardwareimpl.
– Speculative execution
121
while (begin != end)
begin += 1
while (head)
...
head = next
root.size += 1
fork()
while (begin != end)
atomic
begin += 1
atomic
while (head)
...
head = next
root.size += 1
join()
Sequential Program
TM Program
122. Stampede
• STAMP [CaoMinh08]
– Set of real-world applications
using TM
– Historically, performance has
been poor
• Stampede benchmarks
– Re-implemented STAMP in Galois
– Compared to TinySTM
– Median improvement ratio over
STAMP of 4.4X at 64 threads
(max: 37X, geomean: 3.9X)
122
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●●
● ● ●
●●
●
●
●
●
●
●●
●
●
●
● ●
●●
●
●
●
●
●
●●●● ●
●
●
●●
●
●
●
●
●
●●●● ● ● ●
genome kmeans−low
ssca2 vacation−low
yada
0
5
10
0
10
20
0.0
2.5
5.0
7.5
10.0
12.5
0
5
10
15
20
25
0
20
40
60
0 20 40 60
Threads
Speedup
● ●Galois STM
• Blue Gene/Q
• Median of 5 runs
Nguyen and Pingali. What scalable transactional programs need from
transactional memory. In submission to PLDI 2015.
124. Stampede Performance
• Controlling communication is paramount
– Chief performance cost is moving bits
around
• Scalable principle 1: disjoint access
– Recall implementation of Bag and
OrderedByIntegerMetric scheduler
• Scalable principle 2: virtualize tasks
– Be as flexible as possible to dynamic
conditions
• These changes are beyond capability of
TM
– Use Galois data structures (disjoint access)
– Use Galois scheduler (virtualized tasks)
124
fork()
while (begin != end)
atomic
begin += 1
atomic
while (head)
...
head = next
root.size += 1
join()
TM Program
125. Outline
• Parallel Program = Operator + Schedule +
e
• Galois system implementation
– Operator, data structures, scheduling
• Galois studies
– Intel study
– Deterministic scheduling
– Adding a scheduler: sparse tiling
– Comparison with transactional memory
– Implementing other programming models
125
Parallel Data Structure
126. DSLs
• Layer other DSLs on top of Galois
– Other programming models are more restrictive
than operator formulation
• Bulk-Synchronous Vertex Programs
– Nodes are active
– Operator is over node and its neighbors
– Execution is in rounds
• Read values taken from previous round
• Overlapping writes reduced to single value
– Ex: Pregel [Malewicz10], Ligra [Shun13]
• Gather-Apply-Scatter
– Refinement of bulk-synchronous into subrounds
– Ex: GraphLab [Low10] / PowerGraph [Gonzalez12] 126
ca
d
b
e
127. DSLs
• Layer other DSLs on top of Galois
– Other programming models are more restrictive
than operator formulation
• Bulk-Synchronous Vertex Programs
– Nodes are active
– Operator is over node and its neighbors
– Execution is in rounds
• Read values taken from previous round
• Overlapping writes reduced to single value
– Ex: Pregel [Malewicz10], Ligra [Shun13]
• Gather-Apply-Scatter
– Refinement of bulk-synchronous into subrounds
– Ex: GraphLab [Low10] / PowerGraph [Gonzalez12] 127
ca
d
b
e
128. DSLs
• Layer other DSLs on top of Galois
– Other programming models are more restrictive
than operator formulation
• Bulk-Synchronous Vertex Programs
– Nodes are active
– Operator is over node and its neighbors
– Execution is in rounds
• Read values taken from previous round
• Overlapping writes reduced to single value
– Ex: Pregel [Malewicz10], Ligra [Shun13]
• Gather-Apply-Scatter
– Refinement of bulk-synchronous into subrounds
– Ex: GraphLab [Low10] / PowerGraph [Gonzalez12] 128
ca
d
b
e
129. Insufficiency of Vertex Programs
129
• Connected components
problem
– Construct a labeling
function such that color(x)
= color(neighbor(x))
• Algorithms
– Union-Find-based
– Label propagation
– Hybrids
a
b
c
Pointer Jumping
a
b
c
136. PowerGraph runtime
Galois runtime
1
10
100
1000
10000
1
10
100
1000
10000
bfs cc dia pr sssp
PowerGraph runtime
PowerGraph⇠g runtime
1
2
1
2
4
6
bfs cc dia pr sssp
• The best algorithm may
require application-
specific scheduling
– Priority scheduling for
SSSP
• The best algorithm may
not be expressible as a
vertex program
– Connected components
with union-find
• Speculative execution
required for high-
diameter graphs
– Bulk-synchronous
scheduling uses many
rounds and has too
much overhead
136
137. PowerGraph runtime
Galois runtime
1
10
100
1000
10000
1
10
100
1000
10000
bfs cc dia pr sssp
PowerGraph runtime
PowerGraph⇠g runtime
1
2
1
2
4
6
bfs cc dia pr sssp
• The best algorithm may
require application-
specific scheduling
– Priority scheduling for
SSSP
• The best algorithm may
not be expressible as a
vertex program
– Connected components
with union-find
• Speculative execution
required for high-
diameter graphs
– Bulk-synchronous
scheduling uses many
rounds and has too
much overhead
137
138. PowerGraph runtime
Galois runtime
1
10
100
1000
10000
1
10
100
1000
10000
bfs cc dia pr sssp
PowerGraph runtime
PowerGraph⇠g runtime
1
2
1
2
4
6
bfs cc dia pr sssp
• The best algorithm may
require application-
specific scheduling
– Priority scheduling for
SSSP
• The best algorithm may
not be expressible as a
vertex program
– Connected components
with union-find
• Speculative execution
required for high-
diameter graphs
– Bulk-synchronous
scheduling uses many
rounds and has too
much overhead
138
139. Outline
• Parallel Program = Operator + Schedule +
e
• Galois system implementation
– Operator, data structures, scheduling
• Galois studies
– Intel study
– Deterministic scheduling
– Adding a scheduler: sparse tiling
– Comparison with transactional memory
– Implementing other programming models
139
Parallel Data Structure
140. Third-party Evaluations
140
Nadathur et al. Navigating the maze of graph analytics frameworks. SIGMOD 2014.
“Galois performs better than other frameworks, and is close to native performance”
“This paper demonstrates that speculative parallelization
and the operator formalism, central to Galois’ programming
model and philosophy, is the best choice for irregular CAD
algorithms that operateon graph-based data structures.”
Moctar and Brisk. Parallel FPGA Routing based on the
Operator Formulation. DAC 2014.